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Preface 


In the last 15 years more than 3000 articles on simulation and the Monte 
Carlo method have been published. There is real need for a book providing 
detailed treatment of the statistical aspect of these topics. This book 
attempts to fill this need, at least partially. I hope it will make the users of 
simulation and the Monte Carlo method more knowledgeable about these 
topics. 

It is assumed that the readers are familiar with the basic concepts of 
probability theory, mathematical statistics, integral and differential equa- 
tions, and that they have an elementary knowledge of vector and matrix 
operators. Sections 6.5, 6.6, 7.3, and 7.6 require more sophistication in 
probability, statistics, and stochastic processes; they can be omitted for a 
first reading. 

Since most complex simulations are implemented on digital computers, a 
rudimentary acquaintance with computer programming will probably be 
an asset to the readers of this book, though no computer programs are 
included. 

Chapter 1 describes concepts such as systems, models, and the ideas of 
Monte Carlo and simulation. A discussion of these concepts seems neces- 
sary as there is no uniform terminology in the literature. Instead of giving 
rigid definitions, 1 try to make clear what I mean when I use these terms. 
In addition to the terminology, some examples and ideas of simulation and 
Monte Carlo methods are given. 

Chapter 2 deals with several alternative methods for generating random 
and pseudorandom numbers on a computer, as well as several statistical 
methods for testing the “randomness*’ of pseudorandom numbers. 

Chapter 3 describes methods for generating random variables and ran- 
dom vectors from different probability distributions. 

Chapter 4 provides a basic treatment of Monte Carlo integration, and 
Chapter 5 provides a solution of linear, integral, and differential equations 
by Monte Carlo methods. It is shown that, in order to find a solution by 
Monte Carlo methods, we must choose a proper distribution and present 
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the problem in terms of its expected value. Then, taking a sample from this 
distribution, we can estimate the expected value. In addition, variance 
reduction techniques (importance sampling, control variates, stratified 
sampling antithetic variates, etc.) are discussed. 

Chapter 6 deals with simulating regenerative processes and in particular 
with estimating some output parameters of the steady-state distribution 
associated with these processes. Simulation results for several practical 
problems are presented, and variance reduction techniques are given as 
well. 

Chapter 7 discusses random search methods, which are also related to 
Monte Carlo methods. In this chapter I describe how random search 
methods can be successfully applied for solving complex optimization 
problems. 

The final version of this book was written during my 1980 summer visit 
at IBM Thomas J. Watson Research Center. I express my gratitude to the 
Computer Sciences Department for their hospitality and for providing a 
rich intellectual environment. 

A number of people have contributed corrections and suggestions for 
improvement of the earlier draft of the manuscript, especially P. Feigin, 
I. Kreimer, O. Maimon, H. Nafetz, G. Samorodnitsky, and E. Yaschin 
from Technion, Israel Institute of Technology, and P. Heidelberger and S. 
Lavenberg of IBM Thomas J. Watson Research Center. It is a pleasure to 
acknowledge my debt to them. I would also like to express my indebted- 
ness to Beatrice Shube of John Wiley & Sons and to Eliezer Goldberg of 
Technion for their efficient editorial guidance. Many thanks to Marylou 
Dietrich of IBM and to Eva Gaster of Technion for their excellent typing. 

Finally, I thank the following authors and publishers for granting 
permission for publication of the cited material: 

Pages 12-17 based on Handbook of Operations Research , Foundations and 
Fundamentals , Edited by Joseph T. Modem and Salah E. Elmagraby, Von 
Nostrand Reinhold Company, 1978, pp. 570-573. 

Pages 23-25 based on D. E. Knuth, The Art of Computer Programming : 
Seminumerical Algorithms , Vol. 2, Addisson- Wesley, Reading, Massachu- 
setts, 1969, pp. 155-156. 

Pages 199-208 based on Y. R. Rubinstein, Selecting the best stable 
stochastic system, in Stochastic Processes and their Applications , 1980. (to 
appear) 

Pages 253-255 based on Y. R. Rubinstein and I. Weisman, The Monte 
Carlo method for global optimization, Cahiers du Centre d' Etudes de 
Recherche Operationelle , 21, No. 2, 1979, pp. 143- 149. 
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C H A PT E R 1 

Systems, Models, 
Simulation, and the 
Monte Carlo Methods 


In this chapter we discuss the concepts of systems, models, simulation and 
Monte Carlo methods. This discussion seems necessary in the absence of a 
unified terminology in the literature. We do not give rigid definitions, 
however, but explain what we mean when using the above-mentioned 
terms. 


1.1 SYSTEMS 

By a system we mean a set of related entities sometimes called components 
or elements . For instance, a hospital can be considered as a system, with 
doctors, nurses, and patients as elements. The elements have certain 
characteristics, or attributes , that have logical or numerical values. In our 
example an attribute can be, for instance, the number of beds, the number 
of X-ray machines, skill, quantity, and so on. A number of activities 
(relations) exist among the elements, and consequently the elements inter- 
act. These activities cause changes in the system. For example, the hospital 
has X-ray machines that have an operator. If there is no operator, the 
doctors cannot have X-rays of the patients taken. 

We consider both internal and external relationships. The internal 
relationships connect the elements within the system, while the external 
relationships connect the elements with the environment, that is, with the 
world outside the system. For instance, an internal relationship is the 
relationship or interaction between the doctors and nurses, or between 

1 
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Fig. 1.1.1 Graphical representation of a system. 


the nurses and the patients. An external relationship is, for example, the 
way in which the patients are delivered to the emergency room. We can 
represent a system by a diagram, as in Fig. 1.1.1. 

The system is influenced by the environment through the input it 
receives from the environment. When a system has the capability of 
reacting to changes in its own state, we say that the system contains 
feedback. A nonfeedback, or open-loop, system lacks this characteristic. 
For an example of feedback consider a waiting line; when there are more 
than a certain number of patients, the hospital can add more staff to 
handle the increased workload. 

The attributes of the system elements define its stale . In our example the 
number of patients waiting for a doctor describe the system’s state. When 
a patient arrives at or leaves the hospital, the system moves to a new state. 
If the behavior of the elements cannot be predicted exactly, it is useful to 
take random observations from the probability distributions and to aver- 
age the performance of the objective. We say that a system is in equilibrium 
or in the steady state if the probability of being in some state does not vary 
in time. There are still actions in the system, that is, the system can still 
move from one state to another, but the probabilities of its moving from 
one state to another are fixed. These fixed probabilities are limiting 
probabilities that are realized after a long period of time, and they are 
independent of the state in which the system started. A system is called 
stable if it returns to the steady state after an external shock in the system. 
If the system is not in the steady state, it is in a transient state. 

We can classify systems in a variety of ways. There are natural and 
artificial systems , adaptive and nonadaptive systems . An adaptive system 
reacts to changes in its environment, whereas a nonadaptive system does 
not. Analysis of an adaptive system requires a description of how the 
environment induces a change of state. 

Suppose that over a period of time the number of patients increases. If 
the hospital adds more staff to handle the increased workload, we say that 
the hospital is an adaptive system. 
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1.2 MODELS 

The first step in studying a system is building a model. The importance of 
models and model-building has been discussed by Rosenbluth and Wiener 
[32], who wrote: 

No substantial part of the universe is so simple that it can be grasped and 

controlled without abstraction. Abstraction consists in replacing the part of 

the universe under consideration by a model of similar but simpler structure. 

Models ... are thus a central necessity of scientific procedure. 

A scientific model can be defined as an abstraction of some real system, an 
abstraction that can be used for prediction and control. The purpose of a 
scientific model is to enable the analyst to determine how one or more 
changes in various aspects of the modeled system may affect other aspects 
of the system or the system as a whole. 

A crucial step in building the model is constructing the objective 
function, which is a mathematical function of the decision variables. 

There are many types of models. Churchman et al. [4] and Kiviat [18] 
described the following kinds: 

1 Iconic models Those that pictorially or visually represent certain 
aspects of a system. 

2 Analog models Those that employ one set of properties to represent 
some other set of properties that the system being studied possesses. 

3 Symbolic models Those that require mathematical or logical opera- 
tions and can be used to formulate a solution to the problem at hand. 

In this book, however, we are concerned only with symbolic models 
(which are also called abstract models), that is, we deal with models 
consisting of mathematical symbols or flowcharts. All other models (iconic, 
analog, verbal, physical, etc.), although no less important, are excluded 
from this book. 

There are many advantages by using mathematical models. According 
to Fishman [8] they do the following: 

1 Enable investigators to organize their theoretical beliefs and empiri- 
cal observations about a system and to deduce the logical implications of 
this organization. 

2 Lead to improved system understanding. 

3 Bring into perspective the need for detail and relevance. 

4 Expedite the analysis. 

5 Provide a framework for testing the desirability of system modifica- 
tions. 
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6 Allow for easier manipulation than the system itself permits. 

7 Permit control over more sources of variation than direct study of a 
system would allow. 

8 Are generally less costly than the system. 

An additional advantage is that a mathematical model describes a 
problem more concisely than, for instance, a verbal description does. 

On the other hand, there are at least three reservations in Fishman’s 
monograph [8], which we should always bear in mind while constructing a 
model. 

First, there is no guarantee that the time and effort devoted to modeling 
will return a useful result and satisfactory benefits. Occasional failures 
occur because the level of resources is too low. More often, however, 
failure results when the investigator relys too much on method and not 
enough on ingenuity; the proper balance between the two leads to the 
greatest probability of success. 

The second reservation concerns the tendency of an investigator to treat 
his or her particular depiction of a problem as the best representation of 
reality. This is often the case after much time and effort have been spent 
and the investigator expects some useful results. 

The third reservation concerns the use of the model to predict the range 
of its applicability without proper qualification. 

Mathematical models can be classified in many ways. Some models are 
static , other are dynamic , Static models are those that do not explicitly take 
time- variation into account, whereas dynamic models deal explicitly with 
time-variable interaction. For instance. Ohm’s law is an example of a static 
model, while Newton’s law of motion is an example of a dynamic model. 

Another distinction concerns deterministic versus stochastic models. In a 
deterministic model all mathematical and logical relationships between the 
elements are fixed. As a consequence these relationships completely de- 
termine the solutions. In a stochastic model at least one variable is 
random. 

While building a model care must be taken to ensure that it remains a 
valid representation of the problem. 

In order to be useful, a scientific model necessarily embodies elements of 
two conflicting attributes — realism and simplicity. On the one hand, the 
model should serve as a reasonably close approximation to the real system 
and incorporate most of the important aspects of the system. On the other 
hand, the model must not be so complex that it is impossible to understand 
and manipulate. Being a formalism, a model is necessarily an abstraction. 

Often we think that the more details a model includes the better it 
resembles reality. But adding details makes the solution more difficult and 
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converts the method for solving a problem from an analytical to an 
approximate numerical one. 

In addition, it is not even necessary for the model to approximate the 
system to indicate the measure of effectiveness for all various alternatives. 
All that is required is that there be a high correlation between the predic- 
tion by the model and what would actually happen with the real system. 
To ascertain whether this requirement is satisfied or not, it is important to 
test and establish control over the solution. 

Usually, we begin testing the model by re-examining the formulation of 
the problem and revealing possible flaws. Another criterion forjudging the 
validity of the model is determining whether all mathematical expressions 
are dimensionally consistent. A third useful test consists of varying input 
parameters and checking that the output from the model behaves in a 
plausible manner. The fourth test is the so-called retrospective test. It 
involves using historical data to reconstruct the past and then determining 
how well the resulting solution would have performed if it had been used. 
Comparing the effectiveness of this hypothetical performance with what 
actually happened then indicates how well the model predicts the reality. 
However, a disadvantage of retrospective testing is that it uses the same 
data that guided formulation of the model. Unless the past is a true replica 
of the future, it is better not to resort to this test at all. 

Suppose that the conditions under which the model was built change. In 
this case the model must be modified and control over the solution must 
be established. Often, it is desirable to identify the critical input parameters 
of the model, that is, those parameters subject to changes that would affect 
the solution, and to establish systematic procedures to control them. This 
can be done by sensitivity analysis , in which the respective parameters are 
varied over their ranges to determine the degree of variation in the solution 
of the model. 

After constructing a mathematical model for the problem under consid- 
eration, the next step is to derive a solution from this model. There are 
analytic and numerical solution methods. 

An analytic solution is usually obtained directly from its mathematical 
representation in the form of formula. 

A numerical solution is generally an approximate solution obtained as a 
result of substitution of numerical values for the variables and parameters 
of the model. Many numerical methods are iterative, that is, each succes- 
sive step in the solution uses the results from the previous step. Newton’s 
method for approximating the root of a nonlinear equation can serve as an 
example. 

Two special types of numerical methods are simulation and the Monte 
Carlo methods. The following section discusses these. 
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13 SIMULATION AND THE MONTE CARLO METHODS 

Simulation has long been an important tool of designers, whether they are 
simulating a supersonic jet flight, a telephone communication system, a 
wind tunnel, a large-scale military battle (to evaluate defensive or offensive 
weapon systems), or a maintenance operation (to determine the optimal 
size of repair crews). 

Although simulation is often viewed as a “method of last resort” to be 
employed when everything else has failed, recent advances in simulation 
methodologies, availability of software, and technical developments have 
made simulation one of the most widely used and accepted tools in system 
analysis and operations research. 

Naylor et al. [28] define simulation as follows: 

Simulation is a numerical technique for conducting experiments on a digital 
computer, which involves certain types of mathematical and logical models 
that describe the behavior of business or economic system (or some com- 
ponent thereof) over extended periods of real time. 

This definition is extremely broad, however, and can include such 
seemingly unrelated things as economic models, wind tunnel testing of 
aircraft, war games, and business management games. 

Naylor et al. [28] write: 

The fundamental rationale for using simulation is man's unceasing quest for 
knowledge about the future. This search for knowledge and the desire to 
predict the future are as old as the hisiory of mankind. But prior to the 
seventeenth century the pursuit of predictive power was limited almost 
entirely to the purely deductive methods of such philosophers as Plato, 
Aristotle, Euclid, and others. 

Simulation deals with both abstract and physical models. Some simula- 
tion with physical and abstract models might involve participation by real 
people. Examples include link-trainers for pilots and military or business 
games. Two types of simulation involving real people deserve special 
mention. One is operational gaming , the other man-machine simulation. 

The term “operational gaming” refers to those simulations characterized 
by some form of conflict of interest among players or human decision- 
makers within the framework of the simulated environment, and the 
experimenter, by observing the players, may be able to test hypotheses 
concerning the behavior of the individuals and/or the decision system as a 
whole. 
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In operational gaming a computer is often used to collect, process, and 
produce information that human players, usually adversaries, need to 
make decisions about system operation. Each player’s objective is to 
perform as well as possible. Moreover, each player’s decisions affect the 
information that the computer provides as the game progresses through 
simulated time. The computer can also play an active role by initiating 
predetermined or random actions to which the players respond. 

War games and business management games are commonly discussed in 
operational gaming literature (see, e.g., Morgenthaler (23] and Shubik [38]). 

Military gaming is essentially a training device for military leaders; it 
enables them to test the effects of alternative strategies under simulated 
war conditions. For example, the Naval Electronic Warfare Simulator, 
developed in the 1950s, consisted of a large analog computer designed 
primarily to assess ship damage and to provide information to two oppo- 
site forces regarding their respective effectiveness in a naval engagement 
[14, pp. 15, 16]. The exercise, which is one form of simulation gaming, has 
been used as an educational device for naval fleet officers in the final 
stages of their training. 

Business games are also a type of educational tool, but for training 
managers or business executives rather than military leaders. 


A business game is a contrived situation which imbeds players in a simulated 
business environment, where they must make management- type decisions 
from time to time, and their choices at one time generally affect the 
environmental conditions under which subsequent decisions must be made. 
Further, the interaction between decisions and environment is determined by 
a refereeing process which is not open to argument from the players [30, pp, 
7,8]. 

In man-machine simulation there is no need for gaming. While interacting 
with the computer real people in the laboratory' perform the data reduction 
and analysis. 

The following two examples are drawn from Fishman [8]; 

The Rand Systems Research laboratory employed simulation to gener- 
ate stimuli for the study of information processing centers [14, p. 16]. The 
principal features of a radar site were reproduced in the laboratory, and by 
carefully controlling the synthetic input to the system and recording the 
behavior of the human detectors it was possible to examine the relative 
effectiveness of various man-machine combinations and procedures. 

In 1956 Rand established the Logistics System Laboratory under U.S. 
Air Force sponsorship [10]. The first study in this laboratory involved 
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simulation of two large logistics systems in order to compare their effec- 
tiveness under different management and resource utilization policies. 
Each system consisted of men and machines, together with policy rules for 
the use of such resources in simulated stress situations such as war. The 
simulated environment required a specified number of aircraft in flying 
and alert states, while the system’s capability to meet these objectives was 
limited by malfunctioning parts, procurement and transportation delays, 
and the like. The human participants represented management personnel, 
while higher echelon policies in the utilization of resources were simulated 
on the computer. The ultimate criteria of the effectiveness of each system 
were the number of operationally ready aircraft and the dollar cost of 
maintaining this number. 

Although the purpose of the first study in this laboratory was to test the 
feasibility of introducing new procedures into an existing air force logistics 
system and to compare the modified system with the original one, the 
second laboratory problem had quite a different objective. Its purpose was 
to improve the design of the operational control system through the use of 
simulation. 

Naylor et al. [28} describe many situations where simulation can be 
successfully used. We mention some of them. 

First, it may be either impossible or extremely expensive to obtain data 
from certain processes in the real world. Such processes might involve, for 
example, the performance of large-scale rocket engines, the effect of 
proposed tax cuts on the economy, the effect of an advertising campaign 
on total sales. In this case we say that the simulated data are necessary to 
formulate hypotheses about the system. 

Secondly, the observed system may be so complex that it cannot be 
described in terms of a set of mathematical equations for which analytic 
solutions are obtainable. Most economic systems fall into this category. 
For example, it is virtually impossible to describe the operation of a 
business firm, an industry, or an economy in terms of a few simple 
equations. Simulation has been found to be an extremely effective tool for 
dealing with problems of this type. Another class of problems that leads to 
similar difficulties is that of large-scale queueing problems involving multi- 
ple channels that are either parallel or in series (or both). 

Thirdly, even though a mathematical model can be formulated to 
describe some system of interest, it may not be possible to obtain a 
solution to the model by straightforward analytic techniques. Again, eco- 
nomic systems and complex queueing problems provide examples of this 
type of difficulty. Although it may be conceptually possible to use a set of 
mathematical equations to describe the behavior of a dynamic system 
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operating under conditions of uncertainly, present-day mathematics and 
computer technology are simply incapable of handling a problem of this 
magnitude. 

Fourth, it may be either impossible or very costly to perform validating 
experiments on the mathematical models describing the system. In this 
case we say that the simulation data can be used to test alternative 
hypotheses. 

In all these cases simulation is the only practical tool for obtaining 
relevant answers. 

Naylor et ah [28] have suggested that simulation analysis might be 
appropriate for the following reasons: 

1 Simulation makes it possible to study and experiment with the 
complex internal interactions of a given system whether it be a firm, an 
industry, an economy, or some subsystem of one of these. 

2 Through simulation we can study the effects of certain informa- 
tional, organizational, and environmental changes on the operation of a 
system by making alterations in the model of the system and observing the 
effects of these alterations on the system’s behavior. 

3 Detailed observation of the system being simulated may lead to a 
better understanding of the system and to suggestions for improving it, 
suggestions that otherwise would not be apparent. 

4 Simulation can be used as a pedagogical device for teaching both 
students and practitioners basic skills in theoretical analysis, statistical 
analysis, and decision making. Among the disciplines in which simulation 
has been used successfully for this purpose are business administration, 
economics, medicine, and law. 

5 Operational gaming has been found to be an excellent means of 
stimulating interest and understanding on the part of the participant, and 
is particularly useful in the orientation of persons who are experienced in 
the subject of the game. 

6 The experience of designing a computer simulation model may be 
more valuable than the actual simulation itself. The knowledge obtained in 
designing a simulation study frequently suggests changes in the system 
being simulated. The effects of these changes can then be tested via 
simulation before implementing them on the actual system. 

7 Simulation of complex systems can yield valuable insight into which 
variables are more important than others in the system and how these 
variables interact. 

8 Simulation can be used to experiment with new situations about 
which we have little or no information so as to prepare for what may 
happen. 
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9 Simulation can serve as a “preservice test” to try out new policies 
and decision rules for operating a system, before running the risk of 
experimenting on the real system. 

10 Simulations are sometimes valuable in that they afford a convenient 
way of breaking down a complicated system into subsystems, each of 
which may then be modeled by an analyst or team that is expert in that 
area [23, p. 373]. 

11 Simulation makes it possible to study dynamic systems in either real 
time, compressed time, or expanded time. 

12 When new components are introduced into a system, simulation 
can be used to help foresee bottlenecks and other problems that may arise 
in the operation of the system [23, p. 375). 

Computer simulation also enables us to replicate an experiment. Replica- 
tion means rerunning an experiment with selected changes in parameters 
or operating conditions being made by the investigator. In addition, 
computer simulation often allows us to induce correlation between these 
random number sequences to improve the statistical analysis of the output 
of a simulation. In particular a negative correlation is desirable when the 
results of two replications are to be summed, whereas a positive correlation 
is preferred when the results are to be differenced, as in the comparison of 
experiments. 

Simulation does not require that a model be presented in a particular 
format. It permits a considerable degree of freedom so that a model can 
bear a dose correspondence to the system being studied. The results 
obtained from simulation are much the same as observations or measure- 
ments that might have been made on the system itself. To demonstrate 
the principles involved in executing a discrete simulation, an example of 
simulating a machine shop is given in Section 1.4. Many programming 
systems have been developed, incorporating simulation languages. Some of 
them are general-purpose in nature, while others are designed for specific 
types of systems. FORTRAN, ALGOL, and PL/1 are examples of 
general-purpose languages, while GPSS, SIMSCRIPT, and SIMULA are 
examples of special simulation languages. 

Simulation is indeed an invaluable and very versatile tool in those 
problems where analytic techniques are inadequate. However, it is by no 
means ideal. Simulation is an imprecise technique. It provides only statisti- 
cal estimates rather than exact results, and it only compares alternatives 
rather than generating the optimal one. Simulation is also a slow and costly 
way to study a problem. It usually requires a large amount of time and 
great expense for analysis and programming. Finally, simulation yields 
only numerical data about the performance of the system, and sensitivity 
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analysis of the model parameters is very expensive. The only possibility is 
to conduct series of simulation runs with different parameter values. 

We have defined simulation as a technique of performing sampling 
experiments on the model of the system. This general definition is often 
called simulation in a wide sense , whereas simulation in a narrow sense , or 
stochastic simulation , is defined as experimenting with the model over time; 
it includes sampling stochastic variates from probability distribution [19). 
Therefore stochastic simulation is actually a statistical sampling experi- 
ment with the model. This sampling involves all the problems of statistical 
design analysis. 

Because sampling from a particular distribution involves the use of 
random numbers, stochastic simulation is sometimes called Monte Carlo 
simulation . Historically, the Monte Carlo method was considered to be a 
technique, using random or pseudorandom numbers, for solution of a 
model. Random numbers are essentially independent random variables 
uniformly distributed over the unit interval [0, 1], Actually, what are 
available at computer centers are arithmetic codes for generating se- 
quences of pseudorandom digits , where each digit (0 through 9) occurs with 
approximately equal probability (likelihood). Consequently, the sequences 
can model successive flips of a fair ten-side die. Such codes are called 
random number generators , Grouped together, these generated digits yield 
pseudorandom numbers with any required number of elements. We discuss 
random and pseudorandom numbers in the next chapter. 

One of the earliest problems connected with Monte Carlo method is the 
famous Buffon’s needle problem. The problem is as follows. A needle of 
length / units is thrown randomly onto a floor composed of parallel planks 
of equal width d units, where d > /. What is the probability that the needle, 
once it comes to rest, will cross (or touch) a crack separating the planks on 
the floor? It can be shown that the probability of the needle hitting a crack 
is P «■ 2 l/vd, which can be estimated as the ratio of the number of throws 
hitting the crack to the total number of throws. 

In the begining of the century the Monte Carlo method was used to 
examine the Boltzmann equation. In 1908 the famous statistician Student 
used the Monte Carlo method for estimating the correlation coefficient in 
his /-distribution. 

The term “Monte Carlo” was introduced by von Neumann and Ulam 
during World War II, as a code word for the secret work at Los Alamos; it 
was suggested by the gambling casinos at the city of Monte Carlo in 
Monaco. The Monte Carlo method was then applied to problems related 
to the atomic bomb. The work involved direct simulation of behavior 
concerned with random neutron diffusion in fissionable material. Shortly 
thereafter Monte Carlo methods were used to evaluate complex multidi- 
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mensional integrals and to solve certain integral equations, occurring in 
physics, that were not amenable to analytic solution. 

The Monte Carlo method can be used not only for solution of stochastic 
problems, but also for solution of deterministic problems. A deterministic 
problem can be solved by the Monte Carlo method if it has the same 
formal expression as some stochastic process. In Chapter 4 we show how 
the Monte Carlo method can be used for evaluating multidimensional 
integrals and some parameters of queues and networks. In Chapter 5 the 
Monte Carlo method is used for solution of certain integral and differen- 
tial equations. 

Another field of application of the Monte Carlo methods is sampling of 
random variates from probability distributions, which Morgenthaler (23] 
calls model sampling. Chapter 3 deals with sampling from various distribu- 
tions. 

The Monte Carlo method is now the most powerful and commonly used 
technique for analyzing complex problems. Applications can be found in 
many fields from radiation transport to river basin modeling. Recently, the 
range of applications has been broadening, and the complexity and com- 
putational effort required has been increasing, because realism is associ- 
ated with more complex and extensive problem descriptions. 

Finally, we mention some differences between the Monte Carlo method 
and simulation: 

1 In the Monte Carlo method time does not play as substantial a role 
as it does in stochastic simulation. 

2 The observations in the Monte Carlo method, as a rule, are indepen- 
dent. In simulation, however, we experiment with the model over time so, 
as a rule, the observations are serially correlated. 

3 In the Monte Carlo method it is possible to express the response as a 
rather simple function of the stochastic input variates. In simulation the 
response is usually a very complicated one and can be expressed explicitly 
only by the computer program itself. 

1.4 A MACHINE SHOP EXAMPLE 

This example is quoted from Gordon [11, pp. 570-573]. For better under- 
standing of the example an important distinction to be made is whether an 
entity is permanent or temporary. Permanent entities can be compactly 
and efficiently represented in tables, while temporary entities will be 
volatile records and are usually handled by the list processing technique 
described later 
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Consider a simple machine shop (or a single stage in the manufacturing 
process of a more complex machine shop)* The shop is to machine five 
types of parts* The parts arrive at random intervals and are distributed 
randomly among the different types. There are three machines, all equally 
able to machine any part. If a machine is available at the time a part 
arrives, machining begins immediately. If all machines are busy upon 
arrival, the part will wait for service. On completion of machining the part 
will be dispatched to a certain destination, depending on its type. The 
progress of the part is not followed after it is dispatched from the shop. 
However, a count of the number of parts dispatched to each destination is 
kept. 

Clearly, there are two types of elements in the system: parts and 
machines. There will be a stream of temporary elements, that is, the parts 
that enter and leave the system. There is no point in representing the 
different types of parts as different elements; rather, the type is an 
attribute of the parts. As indicated before, it is simpler to consider the 
group of machines as a single permanent element, having as attributes the 
number of machines and a count of the number currently busy. The 
activities causing changes in the system are the generation of parts, 
waiting, machining, and departing. 


(a) System Image A set of numbers is needed to record the state of the 
system at any time. This set of numbers is called the system image , since it 
reflects the state of the system. The simulation proceeds by deciding, from 
the system image, when the next event is due to occur and what type of 
event it will be; testing whether it can be executed; and executing the 
changes to the image implied by the event. 

The image must have a number representing clock time, and this 
number is advanced, in uneven steps, with the succession of events in the 
system. For each part record, there are four numbers to represent the part 
type, the arrival time, the machining time, and the time the part will next 
be involved in an event. The first three of these items are random variates 
derived by the methods described in Chapters 3 and 4. The next event 
time, in general, depends on the state of the system, and must be derived 
as the simulation proceeds. 

The organization used for the system image is illustrated in Fig. 1.4.1. 
There are four frames in this figure, representing successive states of the 
system. The frames are read from left to right and from top to bottom. The 
frame in the top left corner is the initial state. The description of the 
system image is made in terms of that particular frame. 
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The top line of the system image represents the part due to enter the 
system next. As shown here, it is a type 2 part, will require 75 minutes of 
machining, and is due to arrive at time 1002. This, of course, is also its next 
event time. 

Below the next arrival listing is an open-ended list of the parts that have 
arrived and are now waiting for service. Currently, there are two waiting 
parts. As indicated, they are listed in order of arrival. Because the waiting 
parts are delayed, it is not possible to predict a next event time for them. It 
is necessary to see whether there is a waiting part when a machine finishes, 
and to offer service to the first part in the waiting line. 

The next rows of numbers represent the parts now being machined, in 
this case limited to three. Once machining begins, the time to finish can be 
derived and entered as the next event time. Three parts are occupying the 
machines at this time and they have been listed in the order in which they 
will finish. Finally, a number represents the clock time, here set to an 
initial value of 1000, and there are five counters showing how many parts 
of each type have been completed. Note that it is not customary to 
precalculate all the random variates. Instead, each is calculated at the time 
it is needed, so a simulation program continually switches between the 
examination and manipulation of the system image and the subroutines 
that calculate the random variates. 


(b) The Simulation Process Looking now at the system image in Fig. 
1.4.1, assume all events that can be executed up to time 1000 have been 
processed. It is now time to begin one more cycle. The first step is to find 
the next potential event by scanning all the event times. Because of the 
ordering of the parts being machined, it is, in fact, necessary only to 
compare the time of the next arrival with the first listed time in the 
machining section. With the numbers shown in frame 1, the next event is 
the arrival of a part at time 1002, so the clock is updated to this time in the 
second frame. 

The arriving part finds all machines busy and must join the waiting line. 
The successor to the part just arrived is generated and inserted as the next 
future arrival, due to arrive at time 1018. Another cycle can now begin. 
The next event is the completion of machining a part at time 1003. The 
third frame of Fig. 1.4.1 shows the state of the system at the end of this 
event. The clock is updated to 1003 and the finished part is removed from 
the system, after incrementing by 1 the counter for that part type. There is 
a waiting part, so machining is started on the first part in the waiting line, 
and its next event time, derived from the machining time of 84, is 
calculated as 1087. In this case the new part for machining has the largest 
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finish time, and it joins the end of the waiting line. The records in the 
waiting line and the machine segment are all moved down one line. There 
is t)ien another completion at 1017 that, as before, leads to a counter being 
incremented and service being offered to the first part in the waiting line. 
In this case, however, the machining time is short enough for the new part 
to finish ahead of one whose machining started earlier, so, instead of being 
the last listed part, the new part becomes the second in the list. This is 
shown in the last frame of Fig. 1.4.1. 

(c) Statistics Gathering The purpose of the simulation, of course, is to 
learn something about the system. In this case only the counts of the 
number of completed parts by type have been kept. Depending upon the 
purpose of the simulation study, other statistics could be gathered. Simula- 
tion language programs include routines for collecting certain typical 
statistics. Among the commonly used types of statistics are the following; 

1 Counts Counts give the number of elements of a given type or the 
number of times some event occurred. 

2 Utilization of equipment This can be counted in terms of the 
fraction of time the equipment is in use or in terms of the average number 
of units in use. 

3 Distributions This means distributions of random variates, such as 
processing times and response times, together with their means and stan- 
dard deviations. 

(d) List Processing In the machine shop example it was convenient to 
describe the records as though they were located in one of three places, 
corresponding to whether they represented parts that were arriving, wait- 
ing, or being processed. The simulation was described in terms of moving 
the records from one place to the next, possibly with some resorting. A 
computer program that used this approach would be very inefficient 
because of the large amount of data movement involved. Much better 
control and efficiency are obtained by using list processing. With this 
technique each record consists of a number of contiguous words (or bytes), 
some of which are reserved for constructing a list of the records. Each 
record contains, in a standard position, the address of the next record in 
the list. This is called a pointer. A special word, called a header, located in 
a known position, contains a pointer to the first record in the list. The last 
record in the list has an end-of-list symbol in place of its pointer. If the list 
happens to be empty, the end-of-list symbol appears in the header. 

The pointers, beginning from the header, place the records in a specific 
order, and allow a program to search the records by following the chain of 
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pointers. These lists, in fact, are usually called chains. There may be 
another set of pointers tracing through the chain from end to beginning so 
that a program can move along the chain in either direction. It is also 
possible for a record to be on more than one chain, simply by reserving 
pointer space for each possible chain. 

Removing or adding a record, or reorganizing the order of a chain now 
becomes a matter of manipulating pointers. To remove C from a chain of 
the records A, B y C,D y . , . , the pointer of B is redirected to D, If the record 
is being discarded, its storage space would probably be returned to another 
chain from which it can be reassigned later. To put the record Z between B 
and C, the pointer of B is directed to Z and the pointer of Z is set to 
indicate C. Reordering a chain consists of a series of removals and 
insertions. 

As can be seen, list processing does not require that records be physi- 
cally moved. It therefore provides an efficient way of transferring records 
from one category to another by moving them on and off chains, and it 
can easily manage lists that are constantly changing size; these are two 
properties that are very desirable in simulation programming. Therefore 
list processing is used in the implementation of all major discrete system 
simulation languages, including the GPSS and SIMSCRIPT simulation 
programs. 
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CHAPTER2 

Random Number 
Generation 


2.1 INTRODUCTION 

In this chapter we are concerned with methods of generating random 
numbers on digital computers. The importance of the random numbers in 
the Monte Carlo method and simulation has been discussed in Chapter 1. 
The emphasis in this chapter is mainly on the properties of numbers 
associated with uniform random variates. The term random number is used 
instead of uniform random number . Many techniques for generating ran- 
dom numbers have been suggested, tested, and used in recent years. Some 
of these are based on random phenomena, others on deterministic recur- 
rence procedures. 

Initially, manual methods were used, including such techniques as coin 
flipping, dice rolling, card shuffling, and roulette wheels. It was believed 
that only mechanical (or electronic) devices could yield “truly” random 
numbers. These methods were loo slow for general use, and moreover, 
sequences generated by them could not be reproduced. Shortly following 
the advent of the computer it became possible to obtain random numbers 
with its aid. One method of generating random numbers on a digital 
computer consists of preparing a table and storing it in the memory of the 
computer. In 1955 the RAND Corporation published [46] a well known 
table of a million random digits that may be used in forming such a table. 
The advantage of this method is reproducibility; its disadvantage is its lack 
of speed and the risk of exhausting the table. 

In view of these difficulties, John von Neumann [56] suggested the 
mid-square method, using the arithmetic operations of a computer. His idea 
was to take the square of the preceding random number and extract the 
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middle digits; for example, if we are generating four-digit numbers and 
arrive at 5232, we square it, obtain 27,373,824; the next number consists of 
the middle four digits — namely, 3738 — and the procedure is repeated. 
This raises a logical question: how can such sequences, defined in a 
completely deterministic way, be random? The answer is that they are not 
really random, but only seem so, and are in fact referred to as pseudoran- 
dom or quasi-random ; still we call them random , with the appropriate 
reservation. Von Neumann’s method likewise proved slow and awkward 
for statistical analysis; in addition the sequences tend to cyclicity, and 
once a zero is encountered the sequence terminates. 

We say that the random numbers generated by this or any other method 
are “good” ones if they are uniformly distributed, statistically independent, 
and reproducible. A good method is, moreover, necessarily fast and 
requires minimum memory capacity. Since all these properties are rarely, if 
ever, realized, some compromise must be found. The congruential methods 
for generating pseudorandom numbers, discussed in the next section, were 
designed specifically to satisfy as many of these requirements as possible. 


12 CONGRUENTIAL GENERATORS 

The most commonly used present-day method for generating pseudoran- 
dom numbers is one that produces a nonrandom sequence of numbers 
according to some recursive formula based on calculating the residues 
modulo of some integer m of a linear transformation. It is readily seen 
from this definition that each term of the sequence is available in advance, 
before the sequence is actually generated. Although these processes are 
completely deterministic, it can be shown [31] that the numbers generated 
by the sequence appear to be uniformly distributed and statistically inde- 
pendent. Congruential methods are based on a fundamental congruence 
relationship, which may be expressed as [32] 

^ +l *K + f)(raodm), / = 1 (2.2.1) 

where the multiplier a, the increment c, and the modulus m are nonnegative 
integers. The modulo notation (mod m) means that 

X^i-aXi + c-mki, (2.2.2) 

where k t • - [(aX l + c)/m] denotes the largest positive integer in (aX i + 

c)/m. 

Given an initial starting value X 0 (also called the seed ), (2.2.2) yields a 
congruence relationship (modulo m) for any value i of the sequence {A',}. 
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Generators that produce random numbers according to (2.2.1) are called 
mixed congruentiai generators. The random numbers on the unit inverval 
(0, 1) can be obtained by 

(/, = — . (2.2.3) 

m 

Clearly, such a sequence will repeat itself in at most m steps, and will 
therefore be periodic. For example, let a =* c = X 0 ** 3 and m — 5; then the 
sequence obtained from the recursive formula X i+l * 3X t + 3(mod5) is 
X i * 3, 2, 4, 0, 3. 

It follows from (2.2.2) that X ) < m for all i. This inequality means that 
the period of the generator cannot exceed m y that is, the sequence X i 
contains at most m distinct numbers (the period of the generator in the 
example is 4, while m = 5). 

Because of the deterministic character of the sequence, the entire se- 
quence recurs as soon as any number is repeated. We say that the sequence 
“gets into a loop,” that is, there is a cycle of numbers that is repeated 
endlessly. It is shown [31] that all sequences having the form X,+ , « f{X ( ) 
“get into a loop.” We want, of course, to choose m as large as possible to 
ensure a sufficiently large sequence of distinct numbers in a cycle. 

Let p be the period of the sequence. When p equals its maximum, that is, 
when “ m, we say that the random number generator has a full period . It 
can be shown [31] that the generator defined in (2.2.1) has a full period, m f 
if and only if: 

1 c is relatively prime to m, that is, c and m have no common divisor. 

2 a = l(mod g ) for every prime factor g of m. 

3 a s l(mod 4) if m is a multiple of 4. 

Condition 1 means that the greatest common divisor of c and m is unity. 
Condition 2 means that a = g[tf/g] + 1. Let g be a prime factor of m ; then 
denoting K = [a/g], we may write 

a = 1 +gk. (2.2.4) 

Condition 3 means that 

a == 1 +4[«/4] (2.2.5) 

if m/4 is an integer. 

Greenberger [19] showed that the correlation coefficient between X t and 
X;+i lies between the values 



and that its upper bound is achieved when a — m l/2 irrespective of the 
value of c. 
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Since most computers utilize either a binary or a decimal digit system, 
we select m = 2^ or m — 10^, respectively where P denotes the word-length 
of the particular computer. We discuss both cases separately in the 
following. 

For a binary computer we have from condition 1 that m = 2^ guarantees 
a full period. It follows also from (2.2.1) that, for m = 2^, the parameter c 
must be odd and 

« = I (mod 4), (2.2.6) 

which can be achieved by setting 

a *= 2 r + 1, r >2. 

It is noted in the literature [25, 35, 44] that good statistical results can be 
achieved while choosing m « 2 35 , a * 2 7 + 1, and c = I. 

For a decimal computer m = 10^. In order to generate a sequence with a 
full period, c must be a positive number not divisible by g *= 2 or g *= 5, and 
the multiplier a must satisfy the condition a= 1 (mod 20), or alternatively, 
a = 10 r + 1 , r > 1 . 

Satisfactory statistical results have been achieved [1] by choosing a =* 101, 
c = 1, r £ 4. In this case 2f 0 had little or no effect on the statistical 
properties of the generated sequences. 

The second widely used generator is the multiplicative generator 

X t+ x — uA'Xmodm), (2.2.7) 

which is a particular case of the mixed generator (2.2.1) with c =* 0. 

It can be shown [1, 2, 5, 31] that, generally, a full period cannot be 
achieved here, but a maximal period can, provided that X 0 is relatively 
prime to m and a meets certain congruence conditions. 

For a binary computer we again choose m 2 P and it is shown [31] that 
the maximal period is achieved when a 8r ± 3. Here r is any positive 
integer. 

The procedure for generating pseudorandom numbers on a binary 
computer* can be written as: 

1 Choose any odd number as a starting value X Q . 

2 Choose an integer a - 8r ± 3, where r is any positive integer. 

Choose a close to 2^ /2 (if /? - 35, a = 2 17 + 3 is a good selection). 

3 Compute X v using fixed point integer arithmetic. This product will 
consist of 2)8 bits from which the high-order P bits are discarded, and the 
low-order P bits represent X v 

4 Calculate U l = X x /2 fi to obtain a uniformly distributed variable. 


•This procedure and the one that follows are reproduced almost verbatim from Ref. 31. 
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5 Each successive random number X f + x is obtained from the low-order 
bits of the product aX r 

For a decimal computer m = 10^. It is shown in Ref. 49 that the 
maximal period is achieved when a =* 200r ±p 7 where r is any positive 
integer and p is any of the following 16 numbers: (3,11,13,19, 
21, 27, 29, 37,53, 59, 61, 67, 69, 77, 83, 91). The procedure for generating ran- 
dom numbers on a decimal computer can be written as: 

1 Choose any odd integer not divisible by 5 as a starting value A" 0 . 

2 Choose an integer a = 200r ± p for a constant multiplier, where r is 
any integer and p is any of the values 3, 11, 13, 19,21,27, 
29,37,53,59,61,67,69,77,83,91. Choose a close to 10^ /2 . (If 10, 
a » 100,000 ± 3 is a good selection.) 

3 Compute aX 0 using fixed point integer arithmetic. This product will 
consist of 2P digits, from which the high-order /3 digits are discarded, and 
the low-order digits are the value of X v Integer multiplication instructions 
automatically discard the high-order fi digits. 

4 The decimal point must be shifted fi digits to the left to convert the 
random number (which is an integer) into a uniformly distributed variate 
defined over the unit interval U x = A", / 10^. 

5 Each successive random number X , + t is obtained from the low-order 
digits of the product aX f . 

Another type of generator in which X l + ] depends on more than one of 
the preceding values is the additive congruential generator [17] 

+ ^(modrn), Jb — 1,2 /— 1. (2.2.8) 

In the particular case k =*= 1 we obtain the well known Fibonacci sequence, 
which behaves like sequences produced by the multiplicative congruential 
method with a — (l + V^5 )/2. Unfortunately, a Fibonacci sequence is not 
satisfactorily random, but its statistical properties improve as k increases. 

resume: We have seen that a sequence of pseudorandom numbers 

produced by a congruential generator is completely defined by the 
numbers X 0 , a y c, and m. In order to obtain satisfactory statistical results 
our choice must be based on the following six principles*: 

1 The number X 0 may be chosen arbitrarily. If the program is run 
several times and a different source of random numbers is desired each 
time, set X 0 equal to the last value attained by X on the preceding run, 
or (if more convenient) set X 0 equal to the current date and time. 

•These six principles are reproduced by permission from Knuth [31, pp. 155-156]. 
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2 The number m should be large. It may conveniently be taken as 
the computer’s word length, since this makes the computation of ( aX + c) 
(mod/n) quite efficient. The computation of ( aX + cXmodm) must be 
done exactly, with no roundoff error. 

3 If m is a power of 2 (i.e., if a binary computer is being used), pick a 
so that a(mod 8) — 5. If m is a power of 10 (i.e., if a decimal computer is 
being used), choose a so that a(mod 200) = 21. This choice of a y together 
with the choice of c given below, ensures that the random number 
generator will produce all m different possible values of X before it starts 
to repeat. 

4 The multiplier a should be larger than Vm , preferably larger than 
m/100, but smaller than m ~ Vm . The best policy is to take some 
haphazard constant to be the multiplier, such as a - 3,141,592,621 (which 
satisfies both of the conditions in 3). 

5 The constant c should be an odd number when m is a power of 2 
and, when m is a power of 10, should also not be a multiple of 5. 

6 The least significant (right-hand) digits of X are not very random, 
so decisions based on the number X should always be primarily in- 
fluenced by the most significant digits. It is generally better to think of X 
as a random fraction X/m between 0 and 1, that is, to visualize X with a 
decimal point at its left, than to regard X as a random integer between 0 
and m- I. To compute a random integer between 0 and k— 1, we 
would multiply by k and truncate the result. 

Finally, we present in this section the IBM System/360 Uniform Random 
Number Generator, a multiplicative congruenlial generator that utilizes the 
full word size, which is equal to 32 bits with 1 bit reserved for algebraic 
sign. Therefore an obvious choice for m is 2 31 . 

A pure congruential generator (c = 0) with m = 2 k (k > 0) can have a 
maximum period length of m/4. Thus the maximum period length is 
2 3l /4« 2 29 . The period length also depends on the starting value. When 
the modulus m is prime, the maximum possible period length is m — 1. The 
largest prime less than or equal to 2 31 is 2 31 — 1. Hence, if we choose 
m — 2 31 - 1, the uniform random number generators will have a maximum 
period length of m — 1 = 2 31 — 2, which is only the upper bound on the 
period length. The maximum period length depends on the choice of the 
multiplier. Note that the conditions ensuring a maximum period length do 
not necessarily guarantee good statistical properties for the generator, 
although the choice of the particular multiplier 7 5 does satisfy some known 
conditions regarding the statistical performance of the generated sequence. 
The System/360 Generator can be described as follows. Choose any 
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X Q > 0. For n > I, 

X„«7\Y, f _ 1 (mod2 :n - I)- 16,807^. 1 (mod2 3, ~ 1). 

The random numbers are (see (2.2.3)) U„ « X„/(2 31 - 1). 

The results of the statistical tests of the System/360 Uniform Random 
Number Generator indicate that it is very satisfactory. Versions of this 
generator are used in the IBM SL/MATH package, the IBM version of 
APL, the Naval Postgraduate School random number generator package 
LLRANDOM, and the International Mathematics and Statistics Library 
(IMSL) package. The generator is also used in the simulation programming 
language SIMPL/I. The assembly language subroutines GGL1 and GGL2 
of IBM Corporation (1974) also implement this generator, as well as the 
FORTRAN subroutine GGL. 


23 STATISTICAL TESTS OF PSEUDORANDOM NUMBERS 

In this section we describe some statistical tests for checking independence 
and uniformity of a sequence of pseudorandom numbers produced by a 
computer program. As mentioned earlier, a sequence of pseudorandom 
numbers is completely deterministic, but insofar as it passes the set of 
statistical tests, it may be treated as one of “truly’" random numbers, that 
is, as a sample from %((), 1). Our object in this section is to provide some 
idea of these tests rather than present rigorous proofs. For a more detailed 
discussion of this topic the reader is referred to Fishman [11] and Knuth 
[31]. 

23.1 Chi-Square Goodncss-of-FIt Test 

The chi-square goodness-of-fit test, proposed by Pearson in 1900, is 
perhaps the best known of all statistical tests. 

Let X X9 ,,.,X n be a sample drawn from a population with unknown 
cumulative distribution function (c.d.f.) F x (x). We wish to test the null 
hypothesis 

H o - F x(x) for all x, 

where F 0 (x) is a completely specified c.d.f., against the alternative 
//i : F x (x) =£ /q(x), for some x. 

Assume that the N observations have been grouped into k mutually 
exclusive categories, and denote by Af and NpJ the observed number of 
trial outcomes and the expected number for the yth category,/ — 
respectively, when H 0 is true. 
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The test criterion suggested by Pearson uses the following statistic: 


y= £ {"j-VpJ) 

)- 1 


N-p* 




(2.3.1) 


y-» 


which tends to be small when // 0 is true and large when H 0 is false. The 
exact distribution of the random variable Y is quite complicated, but for 
large samples its distribution is approximately chi-square with k — 1 de- 
grees of freedom [15]. 

Under the H 0 hypothesis we expect 

= (2-3.2) 

where a is the significant level, say 0.05 or 0.1; the quantile xf- a 
corresponds to probability l — a is given in the tables of chi-square 
distribution. 

When testing for uniformity we simply divide the interval [0„1] into k 
nonoverlapping subintervals of length 1 jk so that NpJ « N/k , In this case 
we have 



and (2.3.2) can again be applied for testing random number generators. 

To ensure the asymptotical properties of Y it is often recommended in 
the literature to choose N > 5/c and k > 1000, where k -2^ and k — 10^ 
for a binary and a decimal computer, respectively. 


23.2 Kolmogorov -Smirnov Good ness -of -Fit Test 


Another test well known in statistical literature is the one proposed by 
Kolmogorov and developed by Smirnov. 

Let X V ...,X N again denote a random sample from unknown c.d.f. 
F x {x). The sample cumulative distributive function , denoted by r N (x), is 
defined as 


F n {x) =* -^-(number of X i less than or equal to x) 

-It 2 4 — .->(*<>» 


where /, ^ AX) is the indicator random variable (r.v.) that is. 




oo < X < x 
otherwise. 


(2.3.4) 


For fixed x, F N (x) is itself an r.v., since it is a function of the sample. 
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Let us show that F N (x) has the same distribution as the sample mean of 
a Bernoulli distribution, namely 

7»[^(x) = ^]«(^)[F»]‘[l-F JC (x)] w -*. (2.3.5) 

Denote V, » /(_*.„>( X,); then V t has a Bernoulli distribution with parame- 
ter Piy, - 1) = P(X i < x) — F x (x). Since 2fL,K, has a binomial distribu- 
tion with parameters N and F x (x), and since F s (x) = (1 /N) 2^. x V t , the 
result follows immediately. 

From (2.3.5) we see that 

(2.3.6) 

and 

var /„(*) -^/j(*)[ 1 - F x ix)]. (2.3.7) 

Equations (2.3.6) and (2.3.7) show that, for fixed x, F n (x) is an unbiased 
and consistent estimator of F x (x) irrespective of the form of F x (x). Since 
is the sample mean of random variables 7 ( , ^ / * 1, . . . , A^, it 

follows from the central-limit theorem that M*) is asymptotically nor- 
mally distributed with mean F x (x) and variance 0 /N)F x (x)[\ — ^(x)]. 
We are interested in estimating F x (x) for every x (or rather, for a fixed x) 
and in finding how close F#(x) is to /^(x) jointly over all values x . 

The result 

lim p\ sup \F N (x)-F x (x)\>e}-0 (2.3.8) 

^o° L - oo < x < oo 

is known as the Glivenko-Cantelli theorem , which states that for every c > 0 
the step function F N (x) converges uniformly to the distribution function 
/^(x). Therefore for large N the deviation |/^(x) - F x {x)\ between the 
true function />(x) and its statistical image F N {x) should be small for all 
values of x . 

The random quantity 

D n = sup |/v(*) - M*)|. (2.3.9) 

- oo <*< oo 

which measures how far F„(x) deviates from F x (x) is called the 
Kolmogorov-Smirnov one-sample statistic . Kolmogorov and Smirnov proved 
that, for any continuous distribution F x (x), 

(-\y~'exp(-2j 2 x 2 ) - 

i 
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lim P(VN D n < *) = 1-2 Y 

N— 00 v 7Z 


H(x). 

(2.3.10) 
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The function H(x) has been tabulated and the approximation was found 
to be sufficiently close for practical applications, so long as N exceeds 35. 

The c.d.f. H(x) does not depend on the one from which the sample was 
drawn; that is, the limiting distribution of VN D # is distribution-free. This 
fact allows D s to be broadly used as a statistic for goodness-of-fit. 

For instance, assume that we have the random sample X v ... y X N and 
wish to test H 0 : F x (x) ~ F 0 (x) for all x where F 0 (x) is a completely 
specified c.d.f. (in our case F 0 (x) is the uniform distribution in the interval 
(0, 1)). If H 0 is true, which means that we have a good random number 
generator, then 


Vn d n = Vn sup !/•„(*) -/,(*)! (2.3.11) 

— O0 <.X< oo 

is approximately distributed as the c.f.d. H(x). 

If H 0 is false, which means that we have a bad random number 
generator, then F N {x) will tend to be near the true c.d.f. F*(x) rather 
than near F 0 (x), and consequently sup„ OQ<jr<QO | F„(x) - F 0 (x)\ will 
tend to be large. Hence a reasonable test criterion is to reject H 0 if 

su P-®<,< F o(x)\ is lar S e - 

The Kolmogorov-Smirnov goodness-of-fit test with significance level a 
rejects H 0 if and only if Vn D# > Xj „ a where the quantile x, _ a is given in 
the tables of H(x). 

Before we leave the chi-square and Kolmogorov-Smirnov tests, a word is 
in order on the similarity and difference between them. The similarity lies 
in the fact that both of them indicate how well a given set of observations 
(pseudorandom numbers) fits some specified distribution (in our case the 
uniform distribution); the difference is that the Kolmogorov-Smirnov test 
applies to continuous (jumpless) c.d.f/s and the chi-square to distributions 
consisting exclusively of jumps (since all the observations are divided into 
k categories). Still the chi-square test may be applied to a continuous 
F x (x% provided its domain is divided into k parts and the variables within 
each part are disregarded. This is essentially what we did earlier when 
testing whether or not the sequence obtained from the random number 
comes from the uniform distribution. When applying the chi-square test 
allowance must be made for its sensitivity to the number of classes and 
their widths, arbitrarily chosen by the statistician. 

Another difference is that chi-square requires grouped data whereas 
Kolmogorov-Smirnov does not. Therefore when the hypothesized distribu- 
tion is continuous Kolmogorov-Smirnov allows us to examine the good- 
ness-of-fit for each of the rt observations, instead of only for k classes, 
where k < n. In this sense Kolmogorov-Smirnov makes more complete use 
of the available data. 
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As regards the efficiency of the Koimogorov-Smimov and chi-square 
tests, at present too few theoretical results are available to allow meaning- 
ful judgment. 

233 Cramer-von Mises Goodness-of-Fit Test [4] 

This test, like the preceding two, belongs to the goodness-of-fit tests and 
its object is the same as theirs: for a given sample X v . . , , X N from some 
unknown c.d.f. we wish to test the null hypothesis 

H 0 '■ F x( x ) = F o( x )> 


where F 0 (x) is a completely specified distribution, against the alternative 

Hi : F x (x) ¥> F 0 (x) 


for at least one value of x . Denote by X (])t . . . , X {N) the order statistic and 
consider the following test statistic: 


Y 


12 N 


+ 2 




2/— 1 y 

2N 


(2.3,12) 


In other words, the ordinate of F 0 (x) is found at each value in the random 
sample X {0 > and from this is subtracted the quantity (2i — l)/2 A, which is 
the average just before and just after the jump at X (<) — that is, the average 
of (i — 1 )/N and i/N. The difference is squared, so that positive dif- 
ferences do not cancel the negative ones, and the results are added 
together. 

The quantities of Y are tabulated by using an asymptotic distribution 
function of Y as given by Anderson and Darling [2]. The Cramer-von 
Mises goodness-of-fit test, with significance level a, rejects H 0 if and only if 
y>y,_ a , where the quantity y x _ a can be found from the appropriate 
tables. 


23.4 Serial Test [31] 

The serial test is used to check the degree of randomness between 
successive numbers in a sequence and represents an extension of the 
chi-square goodness-of-fit test. 

Let *>= ((/, U k ), (£/*+„ . . . , U 2k \ . . . , X„ - 

(t/ (A ^„)) A+ i, . . . , l/^ A ) be a sequence of N 6-tuples. We wish to test the 
hypothesis that the r.v.’s X v X 2 ,~..yX N are independent and uniformly 
distributed over the k -dimensional unit hypercube. 

Dividing this hypercube into r k elementary hypercubes, each with volume 
\/r k , and denoting by K ( , . . , the number of 6-tuples falling within the 
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element 



we have that the statistic 



J * (2-3.13) 

has an asymptotical chi-square distribution with r k — 1 degrees of freedom. 
Since there are r* hypercubes within which X i may fall* the question of 
available space arises. If k = 3 and r=10QO, the serial test requires 
1000 3 — 10 7 counters — a problematic requirement in terms of both storage 
and search. In these circumstances the test is rarely used for k > 2. 

2.3.5 Tbe-Up-and-Down Test [43] 

For this test the magnitude of each element is compared with that of its 
immediate predecessor in the given sequence. If the next element is larger, 
we have a run-up: if smaller, a run-down. We thus observe whether the 
sequence increases or decreases and for how long. A decision concerning 
the pseudorandom number generator may then be based on the number 
and length of the runs. 

For example, the following seven-term sequence 0.2 0.4 0.1 0.3 0.6 0.7 
0.5 consists of a run-up of length 1, followed by a run-down of length 1, 
followed by a run-up of length 3, and finally a run-down of length 1, and 
may be characterized by the binary symbol as l 0 11 1 0, where 1 denotes a 
run-up and 0 a run-down. More generally, suppose there are N terms, say 
X } < X 2 < • * * <X N when arranged in order of magnitude; the time- 
ordered sequence of observations represents a permutation of these N 
numbers. There are N\ permutations, each of them representing a possible 
set of sample observations. Under the null hypothesis each of these 
alternatives is equally likely to occur. The test of randomness, using 
runs-up and runs-down for the sequence X V ...*X N of dimension N y is 
based on the derived sequence of dimension N - 1, whose ith element is 0 
or 1 depending on whether X i + J — X i > / ~ 1, . . . , iV - 1, is negative or 
positive. A large number of long runs should not occur in a “truly” 
random sample. The test rejects the null hypothesis if there are at least r 
runs of length t or more, where both r and / are determined by the desired 
significance level. 

The means, variances, and covariances of the numbers of runs of length 
t or more are given in Levene and Wolfowitz [34]. 
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The expected numbers of occurrences of runs in a “truly” random 
sample are [43] 


2 N- 1 


3 


N+ 1 

f, 

12 

U 

1 12V — 

14 


for total runs 


12 


for runs of length 2 


2[(* 2 + 3 k+ l)N - (* 3 + 3 k 2 -k- 4)] 


(k + 3)! 


for runs of length k , 


for k < N - 1 


Z, 

jy-y for runs of length N — l. 


Tables of the exact probabilities of at least r runs of the length t or more 
are available in Olmstead [44] for n > 14, from which the appropriate 
critical region can be found. 

A test of randomness can also be based on the total number of runs, 
whether up or down, irrespective of their lengths. The hypothesis of 
randomness is rejected when the total number of runs is small. Levene [33] 
has shown that the r.v. 


U-{2N- l)/3 
[(16Af-29)/90] ,/2 


(2.3.14) 


has a standard normal distribution, so that for large N the test of signifi- 
cance can be readily done. 


2.3.6 Gap Test [31] 

The gap test is concerned with the randomness of the digits in a 
sequence of numbers. Let U V .,.,U N be such a sequence. We say that any 
subsequence U jt U j+ U J+r of r + 1 numbers represents a gap of length 
r if Uj and U j+r lie between a and /? (0 < a < fi < 1) but U J+i , i- 1 — 

1, does not. For a “true” sequence of random numbers the probability of 
obtaining a gap of length r is given in Ref. 44 and is equal to 

F(r) = (0.9)'(0.1). (2.3.15) 

A chi-square goodness-of-fit test based on the comparison of the expected 
and actual numbers of gaps of length r may again be used. 
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23.7 Maximum Test [35] 

Let Yf = max{l/ (jr _ 1)A + ,, . . . , U )k }, j — 1, . . . , Af, be a sequence of N k- 
tuples. It is shown in Ref. 35 that, if the sequence U t , . . . , U /Vk is from 
9l(0, 1), then F*,..., Yji is also from %(0, 1). To check whether or not 
£/| , . . . , U Sk is a “true” sequence of random numbers, we can apply the 
chi-square or the Kolmogorov-Smirnov test to the sequence { Yfj = 

The reader might ask: “How many tests do we need to check the 
random number generator?” and also “Which of them should we choose?” 
In fact, more computer time may be spent testing random numbers than 
generating them. 

Another question that arises is: “What should be done with the sequence 
of numbers if it passes most of the tests but fails one of them?” These 
questions, as well as many others, must be solved by the statistician. 


EXERCISES 


1 Consider a sequence 




where X if X 2> . . . are integers, 0 < X i < m, and 0 <f(X ( ) < m. 


(a) Show that the sequence is ultimately periodic, in the sense that there exist 

numbers A and g for which the values *0.* X.+A-I are distinct, 

but X m + k = X n when n > g. Find the maximum and minimum possible values 
of p and X . 

(b) Show that there exists an n > 0 such that X m = X 2m : the smallest such value of n 
lies in the range ja < n < /i + X, and the value of X n is unique in the sense that, 
if Xj — X 2t and X r — X 2r , then X r ** X, (hence r — / is a multiple of A). 

From Knuth [31). 


2 Prove that the middle-square method using2n-digit numbers to the base has 
the following disadvantage: if ever a number X , whose most significant n digits are 
zero, appears, then the succeeding numbers will get smaller and smaller until zero 
occurs repeatedly. From Knuth [31], 

3 A sequence generated as in exercise 1 must begin to repeat after at most m values 
have been generated. Suppose we generalize the method so that JK I + 1 depends on 
X ,_ , as well as on X,\ formally, let j(x,y ) be a function such that, if 0 < x,y < m, 
then 0 < f(x,y) < m. The sequence is constructed by selecting X 0 and X x arbi- 
trarily, and then letting 

for i > 0. 

Show that the maximum period conceivably attainable in this case is m 2 3 . From 
Knuth (31]. 
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4 Given the two conditions that c is odd and a(mod)4 * I, prove that they are 
necessary and sufficient to guarantee the maximum length period in the sequence 

*,+ 1 * aX t - c(mod m) 

when From Knuth [31]. 

5 Prove that the sequence 

* <+ , « aX i — c(mod m). 


with m — 10', e > 3, and c not a multiple of 2 and not a multiple of 5, will have a 
full period if and only if a(mod 20) = I. From Knuth [31). 

6 Show that the random function 


S n (x) 


n 


2 


/(*-*,) 

n 


where /(f) “ f 1’ 

w lO, if / < 0 


is the empirical distribution function of a sample *,, * 2 , . . . , *„; this should be 
done by showing that S^fx) * F„{x) for all x. 


1 Let F„(x) be the empirical distribution function for a random sample of size n 
from ^(0, I). Define 


for 0 < / < I . 


Prove that var[*„(f)] < var[Z„(f)) for all 0 < t < I and all n. 
9 Find the minimum sample size TV required such that 


F{D„< 0,05) >0.95, 


9 A random sample of size 10 is obtained. 

*,- 0.503 * 2 »0.621 *3 — 0.447 X 4 - 0.203 

* 6 = 0.480 * 7 » 0.320 *8 = 0,581 * 9 = 0.55l 


*5-0.710 

* lo -0.386. 


For a level of significance a — 0.05 test, the null hypothesis 


FAx)”fo(xh for all x , 


where F 0 (x) is from uniform distribution, that is. 


f 0 , if x < 0 
^o(*) 5=5 j x, if 0 < x < 1 
l I, if x > 1 

using: 

(a) The Kolmogorov-Smimov test. 

(b) The Cramer~von Mises test. 
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CHAPTER3 

Random Variate 
Generation 


3.1 INTRODUCTION 

In this chapter we consider some procedures for generating random 
variates (r.v.’s) from different distributions. These procedures are based on 
the following three methods: inverse transform method, composition 
method, and acceptance-rejection method, which are described, respec- 
tively, in Sections 3.2, 3.3, and 3.4. Some generalizations on von Neumann’s 
acceptance-rejection method arc given in Section 3.4.3. Several techniques 
for generating random vectors are the subject of Section 3.5. Sections 3.6 
and 3.7 describe generation of random variates from most widely used 
continuous and discrete distributions, respectively. 

The notations and mode of algorithm presentation are similar to those in 
Fishman J12J and are used here to provide uniformity with other works in 
the field of random variate generation, 

For convenience we refer to sampling from a particular distribution by 
placing the name of the distribution of type of random variate before the 
word generation. For example, exponential generation denotes sampling 
from an exponential distribution. 

For simplicity U is a uniform deviate with probability density function 

(P-d.f.) 



0 < u < 1 
otherwise, 


V is a standard exponential deviate with p.d.f. 



0 < v < cc 
otherwise. 
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and Z is a standard normal deviate with p.d.f. 

/zW x ~ oo < z < oo . 

V 2tt 

X usually denotes the random variable with p.d.f, f x (x) from which we 
wish to generate a value. 


3.2 INVERSE TRANSFORM METHOD 

Let X be a random variable with cumulative probability distribution 
function (c.d.f.) F x (x ). Since F x (x) is a nondecreasing function, the 
inverse function F x *(y) may be defined for any value ofy between 0 and 
1 as: F x \y) is the smallest x satisfying F x (x ) > y, that is, 

Fx~\y) ss ' n i[ x ' F x(.x) 0<y<\. (3.2.1) 

Let us prove that if V is uniformly distributed over the interval (0, 1), then 
(Fig. 3.2.1) 

X = F x \U) (3.2.2) 

has cumulative distribution function F x (x). 

The proof is straightforward: 

P(X < x) - F{F X -\U) < x] -P[U < F x (x)] - F x (x). (3.2.3) 

So to get a value, say x, of a random variable X, obtain a value, say u, of a 
random variable U, compute F x '( u ), and set it equal to .r. 


F X<*> 



Fig. 3.2.1 Inverse probability integral transformation method. 
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The algorithm IT~1 

1 Generate V from %(0, 1). 

2 x^f x \u\ 

3 Deliver X. 


Example 1 Generate an r.v. with p.d.f. 



The c.d.f. is 


0 < * < 1 

otherwise. 


F x( x ) 



Applying (3.2.2), we have 


X = F x -'(U)=U l/ \ 


x<0 
0 < x < 1 

x> 1. 


0 < u < 1. 


(3.2.4) 


Therefore to generate a variate X from the p.d.f. (3.2.4) we generate from 
% (0, 1) and then take a square root from V. 


Example 2 Generate an r.v. from the uniform distribution ^(a, b), that 

is, 


The c.d.f. is 


/*(*)“] b-a’ 

10 , 


a < x < b 
otherwise. 


F x(*) 


0, 


b-a ' 

I, 


x < a 
a < x < b 

x > b. 


and X = F x \U)^a + (b-a)U 


Example 3 Let X lf ...,X n be independent and identically distributed 
(i.i.d) r.v.’s distributed F x (x). Define Y n ~ maxCA',, . . . , X n ) and T, =* 
Generate Y n and Y v The distributions of Y n and Y } are, 

respectively [23], 


/ r r„(>') = [/ r x(>')] 
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and 

Applying (3.2.2), we get 

Y n = F/'(U' /n ) 

and 

\-u' /n ). 

In the particular case where X = U we have 

Y n « f/ 1/,r 

and 

T, * 1 - U x/n . 

To apply this method F x (x) must exist in a form for which the 
corresponding inverse transform can be found analytically. Distributions 
in this group are exponential, uniform, Weibull, logistic, and Cauchy. 
Unfortunately, for many probability distributions it is either impossible or 
extremely difficult to find the inverse transform, that is, to solve 

r fx(l)dt 
J — OC 

with respect to x. 

Even in the case when F x l exists in an explicit form, the inverse 
transform method is not necessarily the most efficient method for generat- 
ing random variates. 

Example 4 Generate a random variable from the piece-wise constant 
p.d.f. (Fig. 3.2.2) 

f / r \ f — i — -X < X, , j 1 , 2, , . , , n 

* I 0, otherwise 

where C t > 0, a = x 0 < x } < • • • < x n ~ , < x n * b. Denote P i — 

and F'-Vj-tPj, F 0 = 0; then 

X- ) x 

F x (x)= 2 P } +( Cjdx = F t _ t + Cj(x - Xf_ } ), 

j- \ I 

where / = max { j : Xj_ t < x) . 
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fx lx) 



FI*. 3 2J, Piece-wise constant p. d. f. 


Now solving F x { X) =* V with respect to X , we obtain 

, where /j _, < U < F f . 

'■'f 

To carry out the method: 

1 Generate t/ from 9l (0, 1). 

2 Find i from 

1 - 1 i 

2 Pj< v ^2 p /’ i * 

y-» 7-1 

3 


t/- s Pj 



4 Deliver 2L 

Example 5 Let /*(*) be represented as 

/rU)-2/i(*). fi* 0- 

Denote 

M*)dx, 

7-1 


(3.2.5) 


f 0 = 0 ( 
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and 

<M*) = f M x )dx. 

Let us prove that 

A r = 4.-'(t/-F i _ 1 ), where F l _ l < U < F, (3.2.6) 

It is easy to see that <J> f / P i is a c.d.f. and that (U— F l _ ] )/P f is distributed 
<71(0, l) if /T_|< U<F r Therefore the r.v. X - Ff \(U - F,_ J/PJ has a 
p.d.f. f t /P l conditional to F t _ x < U < Noticing that X - <f>," \U - /]__ ,) 

the results follow im- 
mediately. To carry out the method; 

1 I*,*-/" „f'{x)dx.i=\ n. 

2 Pj , i = l,... f n. 

3 Generate (7 from ^(0, I). 

4 Find / from T-_,< £/ < F), Tq^-O. 

5 <►,(*)«“/-»/*(■*)<**> I. •••.»• 

6 X*-$~\U - F,„ x ). 

7 Deliver A\ 

As an example, let [22] 

/*(*)=f(l +* 2 ), 

Assume /,(*) = *. / 2 (*) = i* 2 , -l<x<l. Then P, = j, P 2 =» j, <f>,(.x) 
= |(x + 1), <f> 2 (x) *= j(jc 3 + 1), and 

j ®w - 1 if«<2 

| (8m — l) l/3 , if u>\. 


33 COMPOSITION METHOD 

This method is employed by Butler [7]. Refs. 1 1, 22, 29, and 35 exploit this 
method to great advantage. 

In this technique /*(*), the p.d.f. of the distribution to be simulated, is 
expressed as a probability mixture of properly selected density functions. 

Mathematically, let g(x\y) be a family of one-parameter density func- 
tions, where y is the parameter identifying a unique g(x). If a value of y is 
drawn from a continuous cumulative function F r (y) and then if X is 
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sampled from the g(x) for that chosen y, the density function for X will be 
fx(x)= jg{x\y)dF Y (y). (3.3.1) 

If y is an integer parameter, then 

/*(*)- Z'teO'b-O (3.3.2) 

I 

where 

S Pi - 1, Pi > 0; / - 1,2,... ; Pi * P(y - »)• 

i 

By using this technique some important distributions can be generated. 
This technique may be applied for generating complex distributions from 
simpler distributions that are themselves easily generated by the inverse 
transform technique or by the acceptance- rejection technique. 

Another advantage of this technique is that we can sometimes find a 
decomposition (3.3.2) that assigns high probabilities P i to p.d.f.’s from 
which sampling X is inexpensive and concomitantly assign low probabil- 
ities P. to p.d.f/s from which sampling X is expensive. 

Example 1 Generate an r.v. from 

/*(*)-&[»+(*- O 4 ]. 0 < x < 2, 

which can be written 

+ 0 < jjc < 2, 

where 

/,(*)=!. / 2 (x) = f(x-l) 4 , 0 < x < 2. 

Therefore 

2 u 2 , 

x ~ | s 

l + V^, if m, . 


Example 2 (Butler [7]) Generate an r.v. from 

/ x (x) = nf°°y- n e-*'Jy. 

•'i 

Let 

r v ndy 
dF r {y) = ~ 


1 <y < oo ; n > 1 
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and g(x\y) =ye ~ yx . A variate is now drawn from distribution F r (y). 
Once this .v is selected, it determines a particular g(x)=ye ~ yx . The desired 
variate from j x (x) is then simply a variate generated from g(x)=ye~ yx . 
To carry out the composition method: 

1 Generate l/,, U 2 from ^1(0, 1). 

2 

3 X<--(\/Y)\nU 2 . 

4 Deliver X. 

Example 3 Generate an r.v. from 

QO 

F x {x) = 2 0 < x < 1, 

1 

where 2,°t j P t = 1, P t > 0. The algorithm can be written directly: 

1 Generate i/, and U 2 from ^1(0, 1). 

2 Find / from 2*~ \ P k < V x < 2*. i P k , where 2^J , P k — 0. 

3 X<-U 2 ]/i . 

4 Deliver X. 


3.4 ACCEPTANCE-REJECTION METHOD 

This method is due to von Neumann [34] and consists of sampling a 
random variate from an appropriate distribution and subjecting it to a test 
to determine whether or not it will be acceptable for use. 

3.4,1 Single- Variate Case 

Let X to be generated from / v (x), x E /. To carry out the method we 
represent f x (x) as 

f x (x) = Ch(x)g(x) y (3.4,1) 

where C > 1, h(x) is also a p.d.f., and 0 <#(*) < 1. Then we generate two 
random variates U and Y from ^i(0, 1) and h(y ), respectively, and test to 
see whether or not the inequality U < g(Y) holds: 

1 If the inequality holds, then accept Y as a variate generated from 
/*(*)• 

2 If the inequality is violated, reject the pair U y Y and try again. 

The theory behind this method is based on the following. 
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Theorem 3.4.1 Let A" be a random variate distributed with the p.d.f. 
f x ( x), x e/, which is represented as 

/*(*)- £&(*)*(*)» 

where C> 1, 0<g(*)< 1, and h(x ) is also a p.d.f. Let U and Y be 
distributed %(0, 1) and h(y), respectively. Then 

fy(x\U<g(Y))=f x (x). (3.4.2) 


f(t/< g (y)|y«x)*(x) 

P(U<g(Y)) 


Proof By Bayes’ formula 

fy(x\U<g(Y)) 

We can directly compute 
P(U<g(Y)\Y-x) = P(U<g(x))‘g(x) 

P(U<g(Y))-fP(U<g(Y \Y=x))h(x)dx 

■ / g( x)h(x)dx = f dx = ^. 

Upon substituting (3.4.4) and (3.4.5) into (3.4.3), we obtain 
M* \U < g(Y)) « Cg(x)h(x)-f x O c). 


(3.4.3) 

(3.4.4) 

(3.4.5) 


Q.E.D. 


The efficiency of the acceptance-rejection method is determined by the 
inequality U <> g(Y) (see (3.4.5)). Since the trials are independent, the 
probability of success in each trial is p = 1/C. The number of trials N 
before a successful pair U, Y is found has a geometric distribution: 


P N (n)=p(l-p)\ n = 0,1,..., (3.4.6) 

with the expected number of trials equal to C. 

Algorithm AR-1 describes the necessary steps. 

Algorithm AR-1 

1 Generate U from %(0, 1). 

2 Generate Y from the p.d.f. h(y), 

3 If U < g(T), deliver Y as the variate generated from /*(*). 

4 Go to step 1 . 
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For this method to be of practical interest the following criteria must be 
used in selecting fc(x): 

1 It should be easy to generate an r.v. from h(x). 

2 The efficiency of the procedure 1/C should be large, that is, C 
should be close to I (which occurs when h(x) is similar to f x (x) in shape). 
To illustrate this method (Fig. 3.4.1) let us choose C such that f x (x) < 
Ch(x) for all x E /, where C > I . 

The problem then is to find a function 4>(x) = Ch(x ) such that </>(*) > f x (x ) 
and a function h(x) =*<t>(x)/C f from which the r.v.’s can be easily gener- 
ated. 

The maximum efficiency is achieved when / x (x) = <f>(jc), Vr6/, In this 
case l/C=C= 1, g(x) = 1, and there is no need for the acceptance- 
rejection method because h(x) = f x (x ) (to generate a variate from f x (x) is 
the same as from h(x)). 

There exist an infinite number of ways to choose h(x) to satisfy (3.4.1). 
Many papers connected with choosing h(x) have been written, and we 
consider some of them later. 

In the particular case when 4>(x) = Af, a < x < b y and 

(3.4.7) 

we obtain from (3.4.1) 

C =M{b- a) (3.4.8) 

a < x < b. (3.4.9) 

Von Neumann (34) first considered the acceptance-rejection method for 
this particular case, and his algorithm can be described as follows. 

Algorithm AR-2 

1 Generate U x and U 2 from %(0, 1). 

2 K*-a+ U 2 (b-a). 



Fig. 3.4.1 Illustration of von Neumann’s procedure. 
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3 If 


U,<g(Y) = 


JAY) 

M 


Jx[ a + (i>-a)u 2 ] 

M 


deliver Y as the variate generated from f x (x). 
4 Go to step 1 . 


We now consider three examples. The first two are related to Algorithm 
AR-2 and the third to Algorithm AR-l. 


Example 1 Generate a random variate from 

f x (x)**3x 2 , 0<x<\. 

Here M = 3, a = 0, and b = 1. To apply Algorithm AR-2: 

1 Generate two uniform random variates U x and U 2 from %(0, 1). 

2 Test to see if (J } < U 2 . 

3 If the inequality holds, accept U 2 as the variate generated from f x (x). 

4 If the inequality is violated, reject and U 2 and repeat steps 1 
through 3. 


Example 2 Generate a random variate from 

V/? 2 - X* , -R<x<R. 

71 R 2 

Assume M = 2f t nR\ then Algorithm AR-2 is as follows: 

1 Generate two uniform random variates U x and U 2 from %(0, 1 ). 

2 Compute Y = (2U 2 - \)R. 

3 If U x < f x (Y)/M , which is equivalent to (2 U 2 — l) 2 < 1 - f/, 2 , then 
accept Y » (2 U 2 - 1 )R as the variate generated from f x (x). 

4 If the inequality is violated, reject U x and (J 2 and repeat steps 1 
through 3 again. 

The expected number of trials C=4/w and the efficiency 1/C = tj-/4 = 
0.785. 


Example 3 Generate a random variate from 

/*(*)- ’ 0 < o < 1 ; jc > 0. 

r(a) 

To apply the acceptance-rejection method we use the inequality 

x a- \ e -x < ( x a ~ \ 0<x<\ 

~\e'\ x>\. 
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which is the same as 


fx(x) 


x tt 'e 

r(a) 


< <t>(x) = Ch(x) 


-T° ~ 1 

r(a) 

e 

r(a) 


Here 


0 < x < 1 
x>\. 




h(x) 


(l/a) + (l/e) 

e"' 

(1/e) + (l/«) 

i /in 
r(a)\ « + e)' 


and we obtain from (3.4.1) 
*(*) = 



0 < x < 1 

1 < X < 00 


0 < x < 1 

1 < X < 00 . 


To generate a random variate from f x (x) we generate two random variates 
V and Y from ^1(0, 1) and h{ y ), respectively, and then apply the accep- 
tance rule U < g{ Y ). 

Note that the random variate Y can be easily generated by the inverse 
transform method. To apply Algorithm AR-1: 

1 Generate U from ‘?1 (0, I). 

2 Generate Y from h{y). 

3 If 

u<h' y > ) 

l Y " \ I < Y< ooj 

deliver Y as the variate generated from/ r ( x), 

4 Go to step l . 

The probability of success is 

I a + e 
C aer(a) f 

and the mean number of trials is 
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Let us assume that h{x) is known up to the parameter /3, that is, 
h(x) — h{x y fi). It is shown (see Michailov [22] and Tocher [33]) that the 
optimal (i, which provides minimum to C, is achieved by 

min max . (3.4.10) 

fi * h(x,p) 


3.4.2 Multivariate Case 

Theorem 3.4.1 can easily be extended to the multivariate case. The proof 
of the following theorem will be left to the reader. 

Theorem 3.4.2 Let X * ( X v , . . , X„) be a random vector distributed with 

the p.d.f. /*(*), (x ED, where D « {{x l9 ... 9 x n ):a i < x, < b i9 

i *= 1, . . . , n], and suppose f x (x) < M. Generate £/,, . . . , U n+ , from %(0, 1) 
and define Y = (Y v . .., K„), where Y i a^Uj, i « I, . . . Then 

“f f 7* **„"**(*)• 

We can see that this theorem is an extension of von Neumann’s method 
described in Algorithm AR-2 for the multivariate case. 

Example 4 Generate a random vector uniformly distributed over the 
complex region G (Fig. 3.4.2). The algorithm is straightforward. 

1 Generate a random vector Y uniformly distributed in 12 where fl is a 
nice region (multidimensional rectangular, hypersphere, hyperellipsoid, 
etc.). 



Fig. 3,4.2 Generating a random vector uniformly distributed over a complex area. 
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2 If Y e G, accept Y as a variale uniformly distributed in G. 

3 Go to step I . 

Example 5 Generate a random vector uniformly distributed on the surface 
of an n-dimensional unit sphere. 

To generate a random vector uniformly distributed on the surface of an 
n-dimensional unit sphere, we simulate a random vector uniformly distrib- 
uted in the ^-dimensional hypercube { - 1 < x ( < 1 , and then accept or 
reject the sample ( X t , ♦ . . * X „) 9 depending on whether the point (X l 9 ...,X n ) 
is inside or outside the n-dimensional sphere. 

The algorithm is as follows: 

1 Generate l/, from ^(0, 1). 

2 X t *-\ - 2U V ...,X„*-l — 2U„, and Y 2 ^”.^ 2 . 

3 If Y 2 < 1, accept Z - (Z„ . . . , Z„), where Z, = (A’,/K), /= 1 n, 

as the desired vector. 

4 Go to step 1 . 

The efficiency of the method is equal to the ratio 

I _ volume of the sphere _ 1 

C volume of the hypercube n 2" "* T(n/2) 

For even n (n = 2m) 

C m \l 2m m\\2) 

and 

lim = 0. 

m~*oo C 

In other words, for n big enough the acceptance-rejection method is 
inefficient. 

Remark To generate a random vector uniformly distributed inside an 
n-dimensional unit sphere, we have to rewrite only step 3 in the last 
algorithm as follows: 

3 If Y 2 < l, accept Y^(Y l Y n ) as the desired vector. 

3.43 Generalization of von Neumann’s Method 

There are various modifications and generalizations of von Neumann’s 
method [10, 29]. For simplicity consider the single variate case. 

Consider a random vector Y — {Y l9 Y 2 ) distributed hy iYj (y y 2 ), 
- qo <y, < oo, y 2 e [0, M ] and let T(x) be an arbitrary continuous func- 
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lion such that sup x T(x) = M . Similarly to (3.4.2) let us find 

fy^Y^TiY,)) 

which we denote /^(x). By Bayes* formula 

/>{ y, < a, y 2 < r(r,)} 


¥v{y x <x\y 2 < r(y,)) = 


p{y 2 <t(y x )) 

/ •* rT(yi') 

<ty\f l 'r„Y,{y^yi) d y 7 

- oo J — oo 


/ < *y \ / r<ri,/l r„ r,Oi .>2)^2 


= /*(*)• 


(3-4.1 1) 


Differentiating F x {x) with respect to x , we obtain 

fx(x)=Mx\Yi*T(Y } )) 

'T{*) r 


f *'*^,(^2)^2 

J - QO 

/ ^1 f T(y >h r x r } (y^y^^y2 


(3.4.12) 


Theoretically, (3.4.12) offers an infinite number of possibilities for choos- 
ing h and T so as to define a proper f x (x). But, practically, this formula 
has no direct application for generating r.v.'s from /*(x). 

Let Y x and Y 2 be independent. Consider some particular cases, as 
follows. 


case 1 Lei hy iY] (y i ,y 2 ) = h Y [y x )h Y ^y 2 ). Then 


M x ) : 


X h Yj (y 2 )<fy 2 


f f>r t (y,)dy, f h Yi (y 2 )d }> 2 

J y\ J - » 

/ *y i (>' I )// rj (7'(>' l ))<A’ l 


where // K (y) is the c.d.f. of Y. 

The last formula can be written as 


f x (x) = Ch ri (x)H ri (T(x)) 


(3.4.13) 


(3.4.14) 
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where 


C-'-f h Yx { yi )H Yi {T{y } ))dy, (3.4.15) 

y> 

is the efficiency of the method. Thus if Y x and Y 2 are independent and if 
f x (x) can be represented as (3.4.14), we have 

/r,(*l Yi<T(Y x ))=f x (x). 

We can see that (3.4.14) is similar to (3.4.1), When g(x) = H r £T{x)) 
both (3.4.1) and (3.4,14) coincide. In the particular case when T(x ) = x we 
obtain 

fx(x) = Ch ri (x)H Yi (x). (3.4.16) 

Algorithm AR-3 describes the acceptance-rejection method for case 1 . 
Algorithm AR-3 

1 Generate T, from h y (y). 

2 Generate Y 2 from h y ^y), 

3 If Y 2 < T(Y,\ deliver Y v 

4 Go to step 1. 


Example 6 Generate a random variate from beta distribution 

0<x<\,a>0,fi>0. 




(3.4.17) 

Let us use (3.4.16), assuming 



h ri (x)=p(l-xf-\ 

0 < jc < 1 

(3.4.18) 

H ri (x) = x a ~', 

0 < x < 1 

(3.4.19) 

c _i r(« + /J) 


(3.4.20) 

P T(aW) ' 


By the inverse transform method we have 



y, = i - U}", y 2 - 


(3.4.21) 


and Algorithm AR-3 is as follows: 

1 Generate U t and U 2 from %(0, 1). 

2 r,«-i - U t >/0 . 

3 Y 2 *-U 2 ' /{a -'\ 

4 If Y 2 < deliver Y t . 

5 Go to step 1 . 
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Example 7 Consider again the problem of generating a random variate 
from beta distribution (3.4.17). Let us make use of (3.4.14), assuming 

h Yi (x) = ax a ~' y 0 < x < 1 (3.4.22) 


Hy t (T{x)) = (\-x) p - i , 0<x<\ 

c \ n*±£) 

« r(a)n^) ' 


(3.4.23) 

(3.4.24) 


Here 


T(x) = 1 - (3.4.25) 

By the inverse transform method y, = U x /a y ^ \ = l \ and Algorithm 

AR-3 is as follows: 


1 Generate U x and U 2 from %(0, 1). 

2 Y x +-Ul /a . 

3 y 2 *-t/ 2 l/(/? - l) . 

4 If y 2 < 1 ~y„ deliver y,. 

5 Go to step I. 


Remark If /*(*) can be represented as /*(*) = Ch Y (x){\ — H Y2 (T(x)) J, 
then it is easy to see that Algorithm AR-3 can be written as follows. 

Algorithm AR-3' 

1 Generate y, from h Y (y). 

2 Generate Y 2 from h r (y), 

3 If y 2 £ T(y,), deliver y,. 

4 Go to step I , 

CASE 2 Let 0 < T(x) < Af and let y 2 be from ^1(0, Af ), that is, 

h Y t (>2) = I ~M ’ 0 ^>*2 ^ ^ (3.4.26) 

10, otherwise. 

Then it follows directly from (3.4.13) that 

h Y (x)7’( x) 

/*(*) = 1 C t h Yi { x )T( x ), (3.4.27) 

f ftr,(yi) T (yi)4y} 

J y\ 


where 


C, '= f fl r l (yt) T (yi)4'r 

J y> 


(3.4.28) 
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The efficiency of the method is 

C - 1 = P( Y 2 < T( Y , )) =■/ h Yt ( y , ) jf^A Yi (y 2 ) dy 2 

-S*4y,)+, - S f n *4r,my,)+, - ^ • 

(3.4.29) 

Substituting C, = C / M in (3.4.1) and denoting g(x) = 7\x)/ M, we obtain 

fx(x) = Ch Yi {x)g(xh (3.4.30) 

which is exactly (3.4.1). So case 2 corresponds to Algorithm AR-1. 

Example 8 Consider again the problem of generating a random variate 
from beta distribution (3.4.17), representing f x (x) as in (3.4.30), that is, 
applying Algorithm AR-1 and taking into account that 

*(*)-#*(*)-*•■' (3.4.31) 

and 

g(*) = // rj (7U))-( 1 -jc/-', (3.4.32) 

respectively, for both examples 6 and 7; Algorithm AR-1 for example 6 
(see (3.4.17) through (3.4.21)) can be written as: 

1 Generate £/, and U 2 from %(0, 1). 

2 Y+~l-Ul /p . 

3 If U 2 < Y a '\ deliver Y. 

4 Go to step 1 . 

Similarly, for example 7 Algorithm AR-1 can be written as: 

1 Generate U ] and U 2 from ^1(0, 1). 

2 Y<r-ul /a . 

3 If U 2 < (I - Yf' \ deliver Y 

4 Go to step 1 . 

case 3 Let a < x < b, 0 < T{x) < M> and let Y } and Y 2 be independent 
r.v.’s distributed %(a,fr) and L 7l(0, M), respectively. We immediately ob- 
tain from (3.4.14) 

/,(*)" T(x). 

Rewriting f x {x) in the standard way (3.4.1) 

fx( x ) ■ Ch(x)g(x), 


(3.4.33) 



56 


RANDOM VARIATE GENERATION 


we have 

C = M(b — a) (3.4.34) 

*(*)- j T^a' a ^ x ^ b (3.4.35) 

1 0, otherwise 

0<g(x) = ^Ul. (3.4.36) 

Therefore case 3 corresponds to Algorithm AR-2. 

We can easily see that Algorithm AR-3 generalizes both Algorithms 
AR-1 and AR-2 in the sense that, when hyi^x) is distributed uniformly, we 
obtain Algorithm AR-l, and when both h r (x) and hy 3 (x) are distributed 
uniformly, we obtain Algorithm AR-2. But (3.4.1) generalizes (3.4.14) in 
the sense that the c.d.f. Hy 2 (T(x)) is a particular case of g(x), 0 < g(x) < 1. 

resume: Formula (3.4.1) generalizes 3.4.14. In the particular case when 

g(x) can be represented as a c.d.f. H Yj (T(x)) from which a random 
vanate Y 2 can be easily generated, Algorithm AR-3 generalizes Algo- 
rithm AR-1 and as a rule saves computation (CPU) time. 

Formula (3.4.14) can be extended easily for the multivariate case. 

Theorem 3.43 Let Y • ( , Y„) be a random vector with p.d.f. h Y (x) t 
x ~ (x,, . . . ,x„), and let If' be a random variable with p.d.f. h w {w ), 
wG[0, A/]. Let T(x) be an arbitrary continuous function such that 
sup x T(x) — M . Then 

fy , • • • • » r. ( *i • • • • . I W £ T( r )) - Ch y ( * ) H w { T( x)) , (3 .4.37) 

where 

C-'=fh r (y)H„(T(y))dy, y = (y ,y„). (3.4.38) 

The proof of this theorem is left for the reader. 

3.4.4 Forsythe's Method 

Forsythe’s method is a rejection technique for sampling from a continu- 
ous distribution. The original idea is attributed to von Neumann [34]. 
Forsythe [15] described the method explicitly. Other descriptions are given 
by Ahrens and Dieter [2] and Fishman [12] with an application to different 
distributions. Our nomenclature follows that of Forsythe. 
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Suppose we wish to generate a random variable X from any p.d.f. of the 
form 

fx( x ) = x > 0, 

where 

(3.4.39) 

and A(x) is an increasing function of x over the range [0, oo]. In the first 
stage of the method an interval is selected for x , and in the second stage 
the value of x is determined within the interval by a rejection. 

For each k =* 1, 2, . . , , K (K is defined below) pick g k as large as possible 
subject to the constraints 

go” 0 - (3.4.40) 

Next compute 

r*- f gh fx(x)d*> k~\ (3.4.41) 
J o 



Here the number of intervals, K y is chosen as the least index such that r k 
exceeds the largest number less than one that can be represented in a 
computer. ( K may be chosen smaller if we set r k « 1, and if we are willing 
to truncate the generated variable by reducing any value above g A to the 
interval {#*_,.£*). Finally, compute 

**-**-**-1. (3.4.42) 

and the function 

G*(x) = A(g*_ l + x)-/i(g*_ l )</j(g*)-A(g*_,)< 1, 0 <x<d k . 

(3.4.43) 

Now we present the algorithm. Steps 1 to 3 determine which interval 
the variable^ will belong to. Steps 4 to 8 determine the value of 
y within that interval. 


Algorithm F-l 

1 Set /c < — 1 . Generate V from 9l(0, 1). 

2 If U < r k , go to step 4 (the fcth interval is selected). 

3 If U > r A , set k*-k + 1 and go back to step 1. 

4 Generate another uniform deviate U and set X = Ud k . 

5 Set t<r-G k (X). 

6 Generate U v U 2 , . . . * U N where N is such that / > l/,, t > U 2 , . . . , t > 
LVi> but x < V N (N« l if i < (/,). 

7 If N is even, reject X and return to step 1 . 

8 If N is odd, accept X. 
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The proof of the method is given in Forsythe [15] (see also Fishman [12, p. 
400] and Ahrens and Dieter [2]). 

Example 1 Exponential Distribution For h(x) = x, J x (x) is a standard 
exponential distribution and we have g k = *, d k — 1 , and r k = 1 — e~ k for 
all*. 

Example 2 Normal Distribution For h(x) » x l /2> f x (x) corresponds to 
the positive half of the normal distribution and we have g 0 = 0, g x - 1, 
g* = (2Jt- I) 1/2 , k >2. d,= \,d 2 = 3' /2 ~ l,...,d k = (2k- ]) ,/2 -(2k- 
3) l/2 , and k > 2. Also 

G*(*) = y + £*- i-*. k > 1. 

The advantage of this method is that it provides a rejection technique 
for densities of the form (3.4.39) without the need for exponentiation. If 
G*(;c) is easier to calculate than e as it is for many members of the 
exponential family, the method can yield fast algorithms. 

An important feature of the method is that it does not specify a unique 
algorithm, but rather a family of algorithms, subject to (3.4.40) being 
satisfied. The interval widths d k can be chosen at will. 

A disadvantage of the method is that it requires tables of the constants 
g k , d k , and r k . 

3.5 SIMULATION OF RANDOM VECTORS 
3*5.1 Inverse Transform Method 

Let X * ( X yy . . . , X n ) be a random vector to be generated from the given 
c.d.f. /^(x). We distinguish the following two cases. 

case 1 The random variables AT,, * . . >X n are independent. In this case the 
joint p.d.f. is 

n 

fx, *„(*!.• ••>-*,,)= II /(*<). (3.5.1) 

f-1 

where /(*,) is the marginal p.d.f. of the random variable X t . It is easy to 
see that, in order to generate the random vector X =* (X v . . . , X n ) from 
c.d.f. F x {) c), we can apply the inverse-transform method 

X,-Fr l (U,), i- 1 n 

to each variable separately. 


(3.5.2) 
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Example 1 Let X , be independent r.v.’s with the p.d.f. 


/A*,) 


r i 

] b> ~ a, ' 

10, 


a t < Xf < ft,, i = 1, . . . , n 
otherwise. 


To generate the random vector X = ( X v . . . t X„) with the joint p.d.f. 


Sx xl*\' •”»*■) 


1 


II (*f - «,) 

I - 1 


..xJeZ) 


(0, otherwise 

where D — {x x> . . . , x n ) : a t < x t < b t , / = 1, . . , , /f}, we apply the inverse 
transform formula (3.5.2) and get X f — a t + (ft, - tf,)l/,, / = L . . . 


case 2 The random variables are dependent. In this case the joint c.d.f. is 

fx X.(*l *J~/|(*|)/2( Jt 2l*l)' • -U X n \ X I 

(3.5,3) 

where /|(jc,) is the marginal p.d.f of X, and /*(**l*i> . . . , x k _ ,) is the 
conditional p.d.f. of X k given X x x,, X 2 = x 2 , . . , , X k _ | =■ x k _ t . 


Theorem 3.5.1 Let . . . , V n be independent uniformly distributed ran- 
dom variates from %(0, 1). Then the vector X*(X, X n ), which is 

obtained from the solution of the following system of equations 


U x 

F 2 (X 2 \X x )=U 2 


(3.5.4) 


F m {X m \X t X m _ x )-U m 


is distributed according to F r (;t). The proof of this theorem is similar to 
the proof of (3.2.2) and is left for the reader. 

The procedure for generating random variates from (3.5.3) contains only 
two steps. 

1 Generate n independent uniformly distributed variates from 9l(0, 1). 

2 Solve the system of equations (3.5.4) with respect to X * (X x X n ). 

There are n\ ordered combinations (possibilities) to represent the varia- 
bles X l9 ...,X„ in vector X , and therefore n\ possibilities to generate X 
while solving (3.5.4). Thus for n - 2 and n\ = 2 we can write J Xl ,x 2 ( x v x 2 ) 
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in two different ways: 

* fx l ,x J (x l ,x 2 )=f l (x l )f 2 {x 2 \x t ) (3.5.5) 

2 fx>,x,(X\< x 2)=hixM x \\ x i)- ( 3 - 5 . 6 ) 

The efficiency of simulation will generally depend on the order in which 
the random variates X i9 / = are taken while forming the random 

vector X . 

The following example, which is taken from Sobol [29], uses both 
formulas (3.5.5) and (3.5.6) for generating a two-variate random vector 
X = (X ]7 X 2 ) and shows the difference in their efficiency. 


Example 1 


fx u x£ x \> 



if x } + x 2 < 1 , x ] > 0, x 2 > 0 
otherwise. 


CASE 1 


/.r , . * / i > •* 2 ) = /i ( ■* i )/z ( * 2 1 •* i ) • 


The marginal p.d.f. of the r.v. X } is 

/|( JC l)=f 1 X 'fx l X,( x l' X 2) dx 2 = (>X M ~-*|)» 0 < X, < 1. 

J o 

The conditional p.d.f. of the r.v. X 2 , given X } = x,, is 
/(*,,X 2 ) 1 


h(x- 2 \xy)> 


/,(*,) i~V 


0 < x 2 < l — x,. 


The correspondent marginal and conditional distribution functions are, 
respectively, 

/^(x,) * f f\(x l )dx l = 3xf - 2xf, 0 < x t ^ 1 
J o 

^ 2 (* 2 |*i) = f S 1 { x i\ x i)dx 2 ^ x 2 {\ -X,)" 1 , 0 < x 2 < 1 — x,, 

J o 


and the system (3.5.4) is 


3X t 2 -2X, 3 = l/, 

* 2 U -*.)■' = (/ 2 - 


CASE 2 

/jr,.*,(*l>* 2 ) = / 2 (* 2 )/|(*ll* 2 )- 
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The marginal and conditional p.d.f.’s are, respectively, 

M x 2 )= [' 7 (*|. * 2)^*1 = 3(1 - X 2 ) 2 , 0 < x 2 < 1 

• / 0 

= 2 *l0 ~ x i)~ 2 < 0< X, < 1 -x 2 . 

J2\ x 2) 

The corresponding marginal and conditional distribution functions are 

^(* 2 ) = f fl(* 2 )d *2 = 1 - (1 - JC Z ) 3 , 0 < x 2 < 1 
J o 

F\{.x x \x 2 )=f 'f x {x ] \x 2 )dx x = x 2 l { l-x 2 ) -2 , 0 < 2 c,< 1 -x 2 

• / 0 

and the system (3.5.4) is 

{*, 2 (i-* 2 r 2 =£/ 2 ' 

Inasmuch as 1 — U is distributed in the same way as U , the last system can 
be written 

f(! -x 2 ) 3 -u x 

j* 2 = l/ 2 (l -X 2 ) 2 . 

Comparing both cases, we can see that the first system is rather difficult to 
solve (we would have to solve cubic and quadratic equations, respectively), 
while the second system has a trivial solution 

* 2 « 1 - uy 3 

x x - uy y uy\ 

Unfortunately, there is no way to find a priori the optimal order of 
representing the variates in the vector to minimize the CPU time. 

Remark For independent r.v.’s the efficiency of simulation does not 
depend on the order in which the r.v.'s are taken in forming the random 
vector X. 

An alternative method for generating random vectors is the acceptance- 
rejection method based on Theorem 3.4.3. 

3.5.2 Multivariate Transformation Method 

This method can sometimes be useful for generating both random 
variables and random vectors. 
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Suppose that we are given the joint p.d.f. f x of the 

n-dimensional continuous random variable (X ly ... y X n ). Let 

k= x {x, *„)> 0 }- (3.5.7) 

Again assume that the joint density of the random variables Y x = 
g l (X l ,...,X„),...,Y k =g k (X i ,...,X n ) is desired, where k is an integer 
satisfying 1 < k < n. If k < n, we introduce additional new random varia- 

bles Y k + , - g* + ,( X v ...,X n ),...,Y n = g„(X t ,X n ) for judiciously 

selected functions g k + 1 , . . , ,g„; then we find the joint distribution of 
y, , . . . , Y n ; finally, we find the desired marginal distribution of Y v . . . , Y k 
from the joint distribution of K,, , . . , Y n . This use of possibly introducing 
additional random variables makes the transformation y x “ 
— a transformation from an *- 

dimensional space to an n-dimensional space. Henceforth we assume that 
we are seeking the joint distribution of T, « g\(X { , . , . , X n ), . . . , Y„ * 
g„(Af|, , . . , X„) (rather than the joint distribution of T,,..., Y k ) when we 
have given the joint probability density of X x , , . . , X n . 

We state our results for n = 2. The generalization for n > 2 is straightfor- 
ward. Let fx it x 2 ( x \> * 2 ) Set {(x [y x 2 ) :f Xi X2 (x {y x 2 )> 0). We 

want to find the joint distribution of Y x = g x (X x ,X 2 ) and Y 2 *= g 2 (X iy X 2 ) 
for known functions g x (x x ,x 2 ) and g 2 (x l? x 2 )* Now suppose that y x * 
#i(*i>-* 2 ) and y 2 = g 2 (*|, * 2 ) defines a one-to-one transformation that 
maps k onto, say, D. a, and x 2 can be expressed in terms of y x and y 2 ; so 
we can write, say, x x **<p x an ^ *2 “ V 2 Note that k is a 

subset of the x x x 2 plane and D is a subset of they,^ 2 plane consisting of 
points (y v y 2 ) for which there exist a (x l ,x 2 ) ^ k such that (y x ,y 2 )~ 
[g,(x'i,x 2 ), The determinant 


CO 


3y\ 

*y 2 

8 a 2 

dx 2 


dy 2 


is called the Jacobian of the transformation and is denoted by J. The 
above discussion permits us to state Theorem 3.5.2. 

Theorem 3,5*2 Let X x and X 2 be jointly continuous random variables 
with density function Xr,,x 2 ( x i> * 2 )- $ et K “ {( x i> x 2 ) : fx Xt xf< x v x t) > 0}. 
Assume that: 

1 y x * g,(jc,, x 2 ) and y 2 33 g 2 (x if x 2 ) defines a one-to-one transforma- 
tion of k onto D. 
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2 The first partial derivatives of x, ss <p x (y l ,y 2 ) and x 2 = <Pi(y\>yi) are 
continuous over D , 

3 The Jacobian of the transformation is nonzero for (y x >y 2 ) £ D- Then 
the joint density of Y } *- g,( X lt X 2 ) and Y 2 = s g 2 (X i ,X 2 ) is given by 

/y,, Y^y 1^2) ~ \ J \ fx u x 2 i*p \( y \*y2) > \ >y2))i o(y \ ^2) • (3.5.8) 

where 

U>'l-J'2) = ^W>'l^).V2(>’l>>’2)) SS f !!’ ,f 

1 0, otherwise. 

The proof is essentially the derivation of the formulas for transforming 
variables in double integrals. For proof, the reader is referred to Neuts 
[25]. 


For the single variate case the transformation formula (3.5.8) becomes 


fr(y) =/*(« '(>')) 


<*{g ’O')) 


dy 


/.(*-'O0) -/*(*) 


dx 

dy 


/„(x). 


(3.5.9) 


Here /*(x) is the given p.d.f.,/ y (y ) is the desired p.d.f., I K is the interval of 
x, and Y =*g( X ). We can see that (3.5.9) is a particular case of (3.5.8). 


Example 1 Let Z [ and Z 2 be two independent standard normal random 
variables. Let = Z t + Z 2 and Y 2 = Z, /Z 2 . Then 

= and z 2 - 


^2 






1 (1 +^) 2 


fr^rfy^yi) ~ 



7 1 

(1 + 

l ,+ * (i +yzY 


\y t \ i 

1 

(-V..V2) 2 

7 1 — 

0+y 2 ) 2 2ff 

2 

i (i +yi) 2 

i b,l 

1 (• + >'2 2 )>' 2 

2W (l+>'2) 2 

2 

O+/ 2) 2 


yxiy^ 0 _ y i 
(i + /2 ) 3 " (i +^ 2 ) 2 




To find the marginal distribution of, say F 2 , we must integrate outy,, that 
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is 


/ DO 

/y,. Y 2 (y i *->'2)^1 

- 00 


Let 


then 


and so 


J 1_ 

2 ? T 


(1 +*)■' — 


r b.Np 

— oe 


1 0_^lM 

2 0+*) 2 


^1 


m = -r 


1 <1+*?) 2 

y 1 • 


2 0 W 


. ( 1+ >'2) , 


, / , 1 1 (• +j 0 2 , r 00 J 1 1 

/ v ( ) = — 2 J e Jw * 

1 2^ ( 1 +y ) 2 1 + v 2 2 -Ao 77 1 




2 ’ 


a Cauchy density. In other words, the ratio of two independent standard 
normal random variables has a Cauchy distribution. 

To generate an r.v. from a Cauchy distribution we generate Z 1 and Z 2 
from jV( 0, 1) and take their ratio. 


Example 2 Let A', have a gamma distnbution 




-x, > 0 , 

r(«) ’ 0, 


n t > 0 
otherwise 


with parameters and 1 for / * 1,2, and assume X } and A" 2 are indepen- 
dent. Suppose now that the distribution of Y x = X x /(X x + X 2 ) is desired. 
We have only the one function y l = g,(X|,x 2 ) = x ] /(x l + x 2 ), so we have 
to select the other to use the transformation technique. Since x, and x 2 
occur in the exponent of their joint density as their sum, + x 2 is a good 
choice. Let >>2 + x 2 ; then x x = > *2 * ^2 and 


y 2 r 

>2 1 ->i 
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Hence 


1 (n } ) 1 (n 2 ) 

= r , Yp7 — 7>’f ,-1 0 
r(*,)r(* 2 ) 






U»i + ^ 2 / J 

It turns out that K, and Y 2 are independent and that K, has a beta 
distribution with parameters n , and n 2 . 

Thus to generate a random variate from beta distribution we generate 
two gamma variates X t and X 2 , then calculate X t /(X t + X 2 ). 

3.5.3 Multinornml Distribution 

A random vector X = (A - X„) has a multinormal distribution if the 

p.d.f. is given by 

/»<*>- - li- - -**>] < 3 - 5l0 > 

and denoted by N((t, 2). 

Here ft = is the mean vector, 2 is the covariance (nxn) 

matrix 


°U °!2 
°2I °22 


(3.5.11) 


n °n 1 °n 2 • ’ • » 

which is positive definite and symmetric, |2| is the determinant of 2, and 
2~' is the inverse matrix of 2. 

Inasmuch as 2 is positive definite and symmetric, there exists a unique 
lower triangular matrix 

Ik, o ••• oil 


C 2I C 22 


(3.5.12) 
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such that 

2 = CC T . (3.5.13) 

Then the vector X can be represented as 

X~CZ + ix, (3.5.14) 

where Z * (Z,, . . . , Z„) is a normal vector with zero mean and covariance 
matrix equal to identity matrix, that is, all components Z,, i = 1,...,/?, of 
Z are distributed according to the standard normal distribution N(Q, 1). 

In order to obtain C from 2 — CC r the so-called “square root method” 
can be used, which provides a set of recursive formulas for computation of 
the elements of C. 

It follows from (3.5.14) that 

*1 = c tiZ| + M|. (3.5.15) 

Therefore varA", = ct,, = cj t and r, , = a\{ 2 . Proceeding with (3.5.14) we 
obtain 


~~ C 2\^\ + c 22^2 4" ^2 (3.5.16) 

and 

var X 2 * O 22 = var(c 2l Zt + ^ 22 ^ 2 )’ (3.5,17) 

From (3.5.15) and (3.5.16) 

£[(A I -M,)(.Y J -/* 2 )]~° l2 »£[c 1 ,Z,(c 2 ,Z 1 + c 22 Z 2 )]. (3.5.18) 


From (3.5.17) and (3.5.18) 



Generally, c /y can be found from the following recursive formula: 

y-i 

2 ^ikPjk 



(3.5.19) 

(3.5.20) 


(3.5.21) 


where 


o 


2 c ik c Jk = 0 , 

* — I 


I <j <i <n. 
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Algorithm MN-1 describes the necessary steps for generating a multinor- 
mal variate. 


Algorithm MN-i 
1 


where 



2 Generate Z = (Z,, . . . , Z„) from /V(0, 1). 

3 *<-CZ + /ui. 

4 Deliver X. 


3.6 GENERATING FROM CONTINUOUS DISTRIBUTIONS 

This section describes generating procedures for various single-variate 
continuous distributions. 


3.6.1 Exponential Distribution 

An exponential variate X has p.d.f. 

0, otherwise 

denoted by exp(/?). 

Procedure £■/ 

By inverse transform method 

U = F x (X) = 1 -e~ x/ * 

so that 

X = — /Jln(l - U). 

Since I - U is distributed in the same way as (/, we have 

X = —/Jin V. 


( 3 . 6 . 1 ) 

( 3 . 6 . 2 ) 

( 3 . 6 . 3 ) 


( 3 . 6 . 4 ) 
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For sampling purposes we may assume /? — 1: if V is sampled from the 
standard exponential distribution exp(1), then X = fiV is from exp(/J). 

Algorithm E-l 

1 Generate V from %(0, 1). 

2 X<--/?ln(/. 

3 Deliver X. 

Although this technique seems very simple, the computation of the natural 
logarithm on a digital computer includes a power series expansion (or 
some equivalent approximation technique) for each uniform variate gener- 
ated. 

Procedure E-2 

We now prove a proposition that can be useful for generating from 
exponential distribution exp(l). 

Proposition Let U l7 . . „ , U n > U n + X £/ 2n _, independent uniformly 

distributed random variables, and let 6^, . . . , represent the order 

statistics corresponding to the random sample U„+ lf , . . , lf 2n - Assume 
U {0) * 0 and U {n) * ! ; then the r.v.’s 

n 

n u t , k** (3.6.5) 

/- I 

are independent and distributed exp(i). 

Proof Denote 

*=' «-> 

and 

n 

-V„= - In n V,- 

/- l 

It will be shown in Section 3.6.2 that X n is from the Erlang distribution, 
that is, 

x " 0 ' (3 ' 6 ' 6) 

It is also known (Feller [!!]) that the vector (X x , . . . , X n _ ,) is distributed 
fx, = 1)! (3.6.7) 

inside the simplex 

n - i 

^x A <\> x k >0 9 k = 1 , . . . , n — 1 . 

k *» 1 
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Therefore 

fx i xi x \*' ’ ’ ’ X n ) 1 » ’ * • * X n-\)fx{ X n) 


n- L 


= x': 'e ^ x k < \,x k >0,k*= \,... y n. 

*- 1 


We introduce 

and 

Then 


yk= x k x a’ k=\,...,n-\ 

y„ = {\ ~x x -x n _ x )x n . 


* k n 

it s, 


/= i 


k = 1 , — 1 


The Jacobian of the transformation is 


(!/■) 


(n~\) 


Hence 


( 3 . 6 . 8 ) 


fr rS?** • 



(,?/■) 

- (n - 1) 


- n 

I - l 

y, > o,i- i,. 

. . , /I, 

( 3 . 6 . 9 ) 


Q.E.D. 
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Algorithm E-2 describes the necessary steps. 

Algorithm E-2 

1 Generate 2/1—1 uniformly distributed random variates U l9 . . . , 

» ■■■> U 2n-l- 

2 Arrange the variates t/ w+ (/ 2n _ , in order of increasing magni- 
tudes, that is, define them to be the order statistics U (]) , . U {n _ ly 

3 n. 

4 Deliver T*, « I, . . . , n, as an r.v. from exp(l). 

Comparing (3.6.5) with the inverse transform method 
Y k = - In U k , k * 1 , — , /*, 

we find that the advantage of Algorithm E-2 is that it requires only one 
computation of In II for generating n exponential variates simulta- 
neously. In the same time the inverse transform method requires n compu- 
tations of In U k for each variate Y k> separately. The 

disadvantage of Algorithm E-2 is that it needs 2 n - 1 uniform variates 
rather than n uniform variates for the inverse transform method. Addition- 
ally, Algorithm E-2 requires the arrangement of the uniform variates 
U n+ p . . . , U ln _ | to be order statistics t/ (I) , . . . , U {n _ 0 and then calculation 
of U {k - - U {ky which is also time consuming. 

Simulating both algorithms we find that Algorithm E-2 is faster than the 
standard inverse Algorithm E-I for n =■= 3 to n *» 6. The optimal n is 4. 

There are many alternative procedures (Ahrens and Dieter [1], Marsaglia 
[19]) for generating from exp(/3) without the benefit of a logarithmic 
transformation, procedures that are based on the composition method, 
acceptance-rejection method, and Forsythe method [I5J (see also example 
1, Section 3.4.4). The reader is also referred to Fishman’s monograph [12]. 
Before leaving the exponential distribution we want to introduce von 
Neumann’s ingenious method [34] for generating from exp(l), a method 
that was later extended by Forsythe [15] and Ahrens and Dieter [2] for 
generating various distributions. 

Let [X t : / — 0, . . . } be a sequence of i.i.d. r.v.’s from the standard 
triangular distribution 

0 < x < y 
y < x < 1 

and define an r.v. N, taking positive integer values through (A',} by the 


/*(*) = 


2x 

2 ( 1 -*) 

1-Y 
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inequalities 

2 N — l j V 

AT, < A'o, 2 A} £ * 0 2 Xj S AT 0 , 2 > * 0 . 

7-1 7-1 >•! 

We accept the sequence {A",} if N is odd; otherwise we reject it and repeat 
the process until N turns out odd. Let T be the number of sequences 
rejected before an odd N appears (T = 0, 1, . . . ) and let X 0 be the value of 
the first variable in the accepted sequence; then Y * T + X 0 is from exp(l). 
It is shown in ref. 34 that generation of one exponentional variate in such a 
way requires on the average (1 +e)(l - random numbers. 

3.6.2 Gamma Distribution 

A random variable X has a gamma distribution if its p.d.f. is defined as 
« - 1 - x/P 

W 0£ * s “‘»°< ,>0 

0, otherwise, 

and is denoted by G(ct,/?>* Note that for a — 1, G(l,/?) is exp (/?). 

Inasmuch as the c.d.f. does not exist in explicit form for gamma 
distribution, the inverse transform method cannot be applied. Therefore 
alternative methods of generating gamma variates must be considered. 

Procedure G-l 

One of the most important properties of gamma distribution is the repro- 
ductive property , which can be successfully used for gamma generation. Let 
X iy /» be a sequence of independent random variables from 

G(a,,/?)* Then X is from G(a,/J) where a 

If a is an integer, say, a = nt y a random variate from gamma distribution 
G(m,/?) can be obtained by summing m independent exponential random 
variates exp(/?), that is, 

m m 

* = /?2 (~ln (/,)= -£ln II U„ (3.6.10) 

* «* 1 /-= I 

which is called Erlang distribution and denoted Er(m,/3). Algorithm G-l 
describes generating r.v.’s from Er{m,(i). 

Algorithm G-l 

1 X*-0. 

2 Generate V from exp(l). 

3 x<-X+V. 
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4 If a * I, and deliver X . 

5 cr<— « — 1. 

6 Go to step 2. 

It is not difficult to see that the mean computation (CPU) time for 
generation from Erlang distribution is an increasing linear function of a. 
However, if a is nonintegral, (3.6.10) is not applicable and some difficulties 
arise while generating gamma variates. 

For some time no exact method was known and approximate techniques 
were used. The most common method was the so-called probability switch 
method [24]. 

Let m = [a] be the integral part of a and let 8 = a - m. With probability 
8> generate a random variate from G(m + 1,0). With probability 1 -5, 
generate a random variate from G(m,P). This mixture of gamma variates 
with integral shape parameters will approximate the desired gamma distri- 
bution. This technique will only work when a > 1. 

In the particular case when 8 = ^ gamma variables can be generated 
exactly by adding half the square of a standard normal variate to the 
variate generated in (3.6.10). 

Procedure G-2 

Johnk [16] suggested a technique that exactly generates* variates from 
G(5,0), where 0 < 6 < l. 

Theorem 3.6.1 Let W and V be independent variates from beta distri- 
bution Be(8, 1—6) (see Section 3.6.3) and exp(l), respectively. Then 
X — fiVW is a variate with G(5, 0). 


Proof Let u = v and let x « 0wr. Then w — x /0m, and v — w. The Jacobian 
of this transformation is 


7 = 



1 


1 

uft 

0 



(3.6.11) 


The joint distribution of (u, x) is therefore given by 




— L.L — x *-i( u -±) \-u 
r(«)r(i-«) \ p) 

o, 


0 < x, u < co 


otherwise. 

(3.6.12) 


•It is understood that when we say a method "exactly generates" random variables on a 
computer, that the exactness is limited by the computer used and by the randomness of the 
underlying pseudorandom number generator. 
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The marginal distribution for X is 


/ <* 

fu,x( U ' x) du = 
x/n 

which is G( 6 T /?). 


r(«) 

o, 




x > 0 
x < 0 , 

Q.E.D. 


Algorithm G-2 

1 Generate two variates W and V from Be{8, l - 5) and exp(I), 
respectively. 

2 Compute X = fiVW that is from G(5,/J), 

3, Deliver X. 

To generate a variate from G(a, /?) we generate an r.v. Y from Er(m , 1 ), 
then compute X = fi(Y + KfF), which is from G(a,/?). Here a = 8 + m. 


Recently, a number of procedures for sampling from G(a,/?), based on 
the acceptance-rejection method, were suggested by Ahrens and Dieter [3], 
Cheng (9], Fishman [13], Tadikamalla [30, 31] and Wallace [35]. Let us 
consider some of them. 


Procedure G-3 

Wallace [35] suggested a procedure for generating from G(or, 1) with a > 1 
based on both the acceptance-rejection and probability switch methods. 
Let 


/*(*) = Ch{x)g(x), 


where h(x) is a mixture of two Erlang distributions Er(m , 1) and Er(m + 
1 , 1 ) equal to 


h{x) 




(«-»)! 


- + (!-/*) 


x e 
ml 


x>0, (3.6.13) 


and 


C = 


( m - l)lm s 

r(a) 


(3.6.14) 





- ) 


(3.6.15) 


It can be found from (3.4.10) that the optimal P is equal to 1 — 5, where 
8 = a - [a]. It follows from (3.6.14) that the mean number of trials C is a 
monotone decreasing function of m for a fixed 8 and 

, (m — l)!m® . 

lim — — 7 — = I , 


F {m + 8) 


m—*oo 
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that is, asymptotically the execution time does not depend on S and 
achieves optimal efficiency C— I. Algorithm G-3 describes Wallace's 
procedure. 

Algorithm G-3 

1 Compute S = a - m, where m = [a]. 

2 Generate from %(0, I). 

3 With probability J - S compute 

m 

V- -In II U r 

I- I 

4 With probability S compute 

m+ l 

y= -In n u t . 

i=* 1 

5 Generate another uniform variate U from %((), I). 

6 If V < (V/m) 6 /[\ + ((F(/m) ~ 1)5), deliver V as an r.v. from G(a, 1). 

7 Go to step 2. 

The following three procedures are reproduced with little change from 
Ref. 12. 

Procedure G-4 


Fishman [13J describes another procedure for generating from G(a, 1), 
a > I: 


a a " exp( 1 - a) 

(3.6.16) 

h(x)*=-e-* /a 

a 

(3.6.17) 

C= , a > 1, x > 0. 

r(a) 

(3.6.18) 

The probability of success on a trial is 


1 _ T(a) 

c aV-“ 

(3.6.19) 

For large « the mean number of trials is 



(3.6.20) 


It is not difficult to see that the condition U < g(Y), where the r.v. Y is 
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from exp(l/ar), can be written as V 2 > (a — l)(y f — In V x - 1) and V t and 
V 2 are independent r.v.’s from exp(l). 

Algorithm G-4 

1 A*-a-L 

2 Generate K, and V 2 from exp(l). 

3 If V 2 < /I(K, — In Kj — 1), go to step 2. 

4 Deliver K, as a variate from <7(at, I). 

Procedure G-5 

This procedure is due to Cheng [9] and describes gamma generation 
G(a, I) for a > 1 with execution time asymptotically independent of a. 

In Cheng’s procedure 

h(x) = ( ^x x ~'in + x K )~ 2 , x>0 

l 0, otherwise 


r(a)e°A 

*(*)-*- A (,i + *»)*~ -, 


(3.6.21) 

(3.6.22) 

(3.6.23) 


where 

fi » a\ A = (2a-1) ,/2 . 

The execution time C is a monotonically decreasing function of o such 
that, for a * I, C= 1.47, and for a *= 2, C — 1.25; asymptotically 

lim C = — sss 1.13. (3.6.24) 

a-*» 

Let /> = a - In 4 and d — a + 1/A. Then Cheng’s algorithm can be writ- 
ten as follows. 

Algorithm G-5 

1 Sample (/, and U 2 from c ?l(0, 1), 

2 K«-\ln[C/,/(l - t/ 2 )]. 

3 

4 If 6 + </~ * > ln<G 2 t/ 2 ), deliver 

5 Go to step 1. 

Procedure G-6 

Ahrens and Dieter [3] suggested an alternative procedure for generating 
from G(a,/3) with a > 1 and execution time independent of a asymptoti- 
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cally and equal to lim a _ >!3C C = Vn . Their procedure makes use of the 
truncated Cauchy distribution. 

Let 


and 


where 


and 


h(x) = 


M*) 

\-H x {0) 


x > 0 




x > 0, 


C = v ($y 


Y [>-"*«»] 

T(a)e T 


x ) ~ 


ir[p 2 + (x - y) 2 ] 


(3.6.25) 

(3.6.26) 

(3.6.27) 

(3.6.28) 


H x (x ) = j + w 'tan ~ ~ )’ — oo < x < oo (3.6.29) 

are the p.d.f. and c.d.f. of the Cauchy distribution, respectively, with 
parameters y = a - I, and /? = (2a - 1) I/2 . 

It follows from (3.6.25) and (3.6.28) that h(x) is the truncated Cauchy 
distribution with parameters y and /?. 

To apply the acceptance condition U < g(Y\ we have to generate an r.v. 
Y from the truncated Cauchy distribution h(y). The c.d.f. of Y is 


my) 


H r (y)-H r (p) 

\-Hy( 0 ) 


(3.6.30) 


where H Y (y) is given in (3.6.29). 

Substituting (3.6.29) in (3.6.30) and using the inverse transform formula 
Y = H~\U), we obtain 


Y = /3lamr{u[ \ -// y (0)] + H y (0) - 1 } + y 
where by (3.6.29) 


1 



tan '(y/£) 


(3.6.31) 


v 


(3.6.32) 
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It is readily seen that the condition U < g(Y) is equivalent to 


- V=\nU <lng(Y) = \n 


1 + 


(Y'-y ) 2 


+ yin-- — Y + y, 


(3.6.33) 


where V is from exp(l). Y * y + fi tan *({/-£) and can be found from 
U=H Y ,(y). 


Algorithm G-6 

1 y<—a — I. 

2 Generate (/ from %(0, 1). 

3 r^Y + j8tanw((/- £). 

4 Generate K from exp(l). 

5 If -F<In(I + ()"- Y) 2 //S 2 ] + rln(r/Y)- r + y, deliver T. 

6 Go to step 2. 


The following two procedures for generating from G(a, 1) are due to 
Tadikamalla [30, 31]. 


Procedure G-7 

In this procedure [30] h(x) is from £>*(/n,/3), that is, 

(m- 1) -x/fi 

h(x) = , j5>0,m>0,x >0. 

1 ' /T(m- I)! 


Then it is readily shown that 

x^exp[ — x(l 




VP)] 


[SP/(P~ l)]*e 


X > 0 


(3.6.34) 


(3.6.35) 



where 8 — a — m and m = [a]. 

The value of a that maximizes the efficiency can be found from (3.4.10) 
and is equal to a/m . 

Tadikamalla showed by simulation that his procedure is faster than 
Fishman’s Procedure G-4 for 3 < a < 19 and is comparable for other 
values of a. For 1 < a < 2 both methods coincide. This is not surprising. 
The reason for the great efficiency of this procedure is that Erlang 
distribution Er(m f (}) 7 with n = [a], approximates the gamma distribution 
G(a, /?) better than the exponential distribution exp(a) (see Procedure 
G-4) does. 
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In addition, Tadikamalla's procedure is better than Ahrens and Dieter’s 
Procedure G-6 for or < 8. 


Algorithm G-7 

1 Compute <5<— a — m, where m = [a]. 

2 Generate m independent random variates U ly ...,U m from 1). 

3 Compute Y = 

4 Generate another uniform variate U from <21(0, 1), 

5 If 

r*exp[-ni -!//?)] 

\ jsl r e -» 

.(/3-1) J 

deliver Y. 

6 Go to step 2, 

Procedure G~8 

In this procedure [31] h(x) is from the Laplace (double exponential) 
distribution with location parameter a ~ l and scale parameter 9 y that is, 


h(x) 


exp{-|x - (g - l)|/fl} 
20 


Then it is readily shown that 


-oo<x<oo,0>O. 


(3.6.37) 


«(*) 


and 


(9- l)x 

'«p{ 

_ x+ [±. 

-(a- !)|-( a -I)(0+ 1)] ) 

9(a — 1) J 


0 J 

(3.6.38) 

„ 20“ j 

r a - 1 


f (a- 1)0 +0)1 ,, , 

r(«)l 

[ 0- I 

-j expj 

[ J )• (3-6.39) 


Algorithm G-8 

1 Generate a random variate Y from the Laplace distribution with 
location parameter a - I and scale parameter 9. 

2 If Y < 0, go to step 1 . 

3 Generate a uniform random variate from 9i(0, 1). 

4 If U < g(Y) (see (3.6.38)), deliver X . 

5 Go to step 1 . 
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Table 3.6.1 The Relative Efficiencies (1/C), and the Average Number of Random 
Numbers Required ( iV ) for Certain Algorithms 



Fishman 


Tadikamalla 1 

Tadikamalla 2 

a 

l/C 

yv 

l/C 

N 

l/C 

S 

1.5 

0.7953 

2.5 

0.7953 

2.5 

■m 

2.3 

2.5 

0.6029 

3.3 

0.8871 

3.4 

WSSSSm 

2.5 

3.5 

0.5047 

4.0 

0.9222 

4.3 

0.7565 

2.6 

5,5 

0.3992 

5.0 


6.3 

0.7304 

2.7 

8.5 

0.3194 

6.3 

0.9695 

9,3 

0.7174 

2.8 

10.5 

0.2868 

7.0 

0.9755 

11.3 

0.7144 

2.8 

15.5 

0.2355 

8.5 

0.9836 

16.3 

0.7132 

2.8 

20.5 

0.2045 

9.8 

0.9876 

21.3 

0.7149 

2.8 

30.5 

0.1674 

11.9 

0.9917 

31.3 

0.7195 

2.8 

100.5 

0.0920 

21.7 

1.0000 

101.0 

0,7355 

2.7 


Tadikamalla [31] compared the relative efficiency and CPU timing of his 
procedures with Fishman's [13] and Ahrens and Dieter’s procedures [3]. 

Table 3,6,1 gives the relative efficiencies and the number of uniform 
random numbers required for these procedures for some selected values of 
a. The efficiencies of Ahrens and Dieter’s method are not given in Table 
3.6.1 because these have to be calculated numerically and the details are 
not available in Ref. 3, For increasing values of a the efficiency of 
Fishman’s algorithm decreases and the efficiencies of Tadikamalla’s first 
algorithm (G-7) and of Ahrens and Dieter’s algorithm increase. The 
efficiency of Tadikamalla’s second algorithm (G-8) decreases as a in- 
creases up to a certain value and then it increases again. Also note that the 
number of uniforms required for Tadikamalla’s second algorithm (G-8) 
remains fairly constant. 

Table 3.6.2 gives the CPU timings for these four methods on an IBM 
370/165 computer, for selected values of a. These timings are based on 
generating 10,000 variates and using the subroutine TIMER available on 
the IBM computer. 

The following observations can be made about the methods compared 
above. 

1 Fishman’s procedure is the simplest of all the procedures and the 
CPU time per trial is constant for any a. As a increases, the number of 
trials required for one gamma variate increases (efficiency decreases), and 
thus the CPU time per variate increases with a. 
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Table 3.6*2 Average CPU Times (in Microseconds) to Generate One Gamma 
Variate on the IBM 370/165 Computer 


a 

Fishman 

Tadikamalla 1 

Ahrens and 
Dieter 

Tadikamalla 2 

1.5 

127 

137 

N/A 

138 

2.5 

175 

176 

N/A 

152 

3.5 

213 

184 

225 

157 

5.5 

260 

225 

218 

162 

8.5 

334 

307 

210 

166 

10.5 

380 

354 

209 

167 

15.5 

473 

470 

203 

167 

20.5 

559 

596 

194 

166 

30.5 

693 

850 

190 

165 

50.5 

— 

— 

181 

164 

100.5 

— 

— 

171 

162 


2 Tadikamalla’s first procedure (G-7), is also simple, and in this case 
the number of trials per gamma variate decreases as a increases. However, 
the CPU time per trial increases with a (more uniforms are required per 
trial). The average CPU time per variate for this procedure increases with 
a. Tadikamalla’s procedure is faster than Fishman’s procedure for 3 < a < 
19 and the same as Fishman's procedure for 1 < a < 2. 

3 Tadikamalla’s second procedure, (G-8), is faster than Fishman’s and 
Tadikamalla’s first procedure (G-7) for a > 2 and is faster than Ahrens 
and Dieter’s for all a . The average CPU time required per variate for 
Tadikamalla’s second procedure remains fairly constant for medium and 
large values of a. 

3.6*3 Beta Distribution 

An r.v. X has a beta distribution if the p.d.f. is 

/,(•*)- -*)*-'» «>0,/?>0,0<*< 1 (3.6.40) 

r(«)up) 

and is denoted by Be(a,(i). There are several ways of generating from 
Be(a,p). 

Procedure Be- 1 

This procedure is based on the result from Section 3.5.2 (example 2) that 
says: if T, and Y 2 are independent r.v.’s from G(a, 1) and G{fi, 1), 
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respectively, then 


X ! — 

y, + y 2 

is from Be(a,0). 

The corresponding algorithm is as follows. 
Algorithm Be-J 

1 Generate Y, from G(a, 1). 

2 Generate Y 2 form G(0, 1). 

3 X^Y,/(Y t +Y 2 ). 

4 Deliver X. 


(3.6.41) 


Procedure Be -2 

Another approach when a and 0 are integers is based on the theory of 
order statistics. Let U t , . . . , U a+e _ , be random variates from %(0, 1). Then 
the oth order statistic l/ (o) is from Be(a,0). The algorithm is extremely 
simple. 

Algorithm Be- 2 

1 Generate (a + 0 - 1) uniform random variates (/,, . . . , U a + p _ l from 
%<0, 1). 

2 Find t/ (0) , which is from Be(a,0). 

It can be shown that the total number of comparisons needed to find G (o) 
is equal to (a/2) (a + 20- I), that is, this procedure is not efficient for 
large a and 0. 

Many procedures for sampling from Be(a,0) with nonintegral a and 0 
have been proposed recently (see Ahrens and Dieter |3), Cheng (8), Johnk 
[16], and Michailov [22]). We consider only a few of them. 

Procedure Be-3 

The simplest procedure for generating from Be(a,0 ) with arbitrary nonin- 
tegral a and 0 uses the mode 


f x (x*) = M = 


Tja + 01 1 a — 1 

r(a)r(/J)\a+/?-2 


ff -1 

a + 0 — 2 



(3.6.42) 


which corresponds to x* » (a — l)/(a + 0 — 2). 
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The following algorithm, Be-3, is based on the acceptance-rejection 
Algorithm AR-2. 

Algorithm Be- 3 
1 


f 1 

!_ r(« + /3) , 

f 0, -1 ) 

a — 1 
1 1 

f P~\ ) 

U + /3-2J 

1 n«)nfiv 

{a + 0-2} 

{0 + a-2) 


2 Generate U i and U 2 from %(0, 1). 

3 If MU 2 £[Y(a + &)/Y{a)Y{p)\U?-\\-U x ) fi -\ deliver (/, as a 
variate from Be(a,[3). 

4 Go to step 2. 

Procedure Be-4 

This procedure is due to Johnk [16} and is based on the following theorem. 


Theorem 3.6 2 Let £/, and U 2 be two uniform variates from %(()> l) and 
let y, = V\ /a and Y 2 = U 2 ' /f> . If Y t + Y 2 < I, then 


X = 


Y,+ Y 2 


(3.6.43) 


is from 


Proof It is obvious that 



fr,(y i) " 

■ «vr 

0 < V, < 1 

(3.6.44) 


fr,(y 2 ) = 


0<>’ 2 <1 

(3.6.45) 

and 










— <xflyi~ l y£ *• 

(3.6.46) 

Let X - 

- r,/(y, + y 2 ) 

and 

= T ! + y 2 . The Jacobian 





9 >i 




j = 

3jc 

9iv 

d>2 

\ W X 1 

1 —w 1 — X I W 

(3.6.47) 



9jc 

3h’ 



and 






fx. 

H ,(x,w) = a/}x a 


l w a+p -', 0 < x < l, 0 < 

w < 2. 


(3.6.48) 
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By Bayes’ formula 


M*|0 < W < 1) « ~ ! - (3.6.49) 

r Pr(0 < W < \) ’ 

f X ' W {x,0<W< I)- = 

/ o a + p 


0 < x < 1 

Pr (0 < W< 1)= f'f x „,(jc,0< W< \)dx 
J o 


(3.6.50) 

(3.6.51) 


afi r(a)T(/3) 

a + (J r(a + (3) ' 

Substituting (3.6.51) and (3.6.50) into (3.6.49), we obtain 


A(x|0< W< 1) = 


r(q+/?) 

r(a)n^) 


*““'(1 


0 < * < 1 


The efficiency of the method is equal to 

± = p(y, + y 2 < i) = p r (o< w< l) 


For integer a and /? 

r (« 4-/8)! 

a!/?! 


Q.E.D. 


a/? r(a)r()8) 

« + /* r(a+j8) ‘ 

(3.6.52) 

(3.6.53) 


Table 3.6.3 represents the mean number of trials C as a function of a 
and fi. Asymptotically, 

lim C = lim C = lim C = oo . 

u— » oo (i—*ao a — * oo 

/3>0 ti >0 />~*oo 

Thus for large a or /f Johnk’s procedure is not efficient. 


Table 3.6 3 llie Mean Number of Trials as a Function of a and )3 


a 


p 


1 

3 

5 

1 

2 

4 

6 

3 

4 

20 

56 

5 

6 

56 

252 
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Algorithm Be-4 

1 j*—l. 

2 Generate U and U + l from %(0, 1). 

3 Y x <r-U' /a . 

4 Y 2 *-U//f. 

5 If Y, + Y 2 > 1 , go to step 2. 

6 y «— y + 2. 

7 Deliver X = T, /( Y t + X 2 ). 

Procedure Be-5 

This procedure is based on the results of examples 6 and 7 from Section 
3.4.3. As follows from (3.4.20) and (3.4.24), the efficiencies of the accep- 
tance-rejection method AR-3 are, respectively, 

l = W) 

C Y(a + P) 

1 ar(a)r (/?) 
c T(a + p) 

in examples 6 and 7. 

For integer a and P we have, respectively, 

1 (a — l)!jB! 

C = (a + p- 1)! 

1 a! (ft - 1)! 

C (a + P~ 1)! 


(3.5.54) 

(3.5.55) 

(3.6.56) 

(3.6.57) 


In both cases (3.6.56) and (3.6.57) the efficiencies are a little higher than in 
Johnk's procedure Be-4 (see (3.6.53)). It is interesting to note that for /}> a 
it is more efficient to represent /*(*) in the form of (3.4.18) through 
(3.4.20) and for a > (3 it is more efficient to represent /*(*) in the form of 
(3.4.22) through (3.4.24). 

Procedure Be-6 


In this procedure h(x) is Be(m>n) y that is. 


0 < x < 1 


(3.6.58) 


where m = (a) and n = (/?]. Then 

Cg(x) 




B(a,P) 


0 < jc < 1 
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where 8, — a — m 7 8 7 = fi — n, and B(r,s) — r(r)r(i)/r(r + r). It is quite 
easy to prove that the function y = x*'(! - x) 6 ’ is concave on [0, IJ and 
achieves its unique maximum 




fif'fi®' 




S \ + S 2 


at the point x* 


«. 


5 I + V 


Now we set 

g(x) 

and 


x*'(l-x)* J , ti (8 j + ^) 


6 , + 9 , 


y m 


= x a, (I — X ) 


fif'fi 2 *> 


2?(m,n) fif'fi®’ 

= = (5, + 5 2 )®' + ®' 

The efficiency of the procedure is 

1 B(a, /3) (fi, + 5 2 ) 4l+ *' 


C 5(m,n) 5f'6 2 *> 


It is easy to see that 


, i («,+«:)'’ 


C (am + /i + l)(m + n) 8 f l 82 J 


(3.6.59) 


(3.6.60) 


(3.6.61) 


Comparing (3.6.61) with (3.6.56) and (3.6.57) T we can also readily prove 
that Procedure Be-6 is more efficient than Procedure Be- 5 for a > 2, ft > 2. 

Algorithm Be-6 

1 Generate V from ^1(0, 1). 

2 Generate Y from Re(m,n). 

3 If 


u< y®*(i - y ) 1 


(6, + fi 2 )® , + ® 1 


5f-8*» 


deliver V. 

4 Go to step 1. 


Remark If 5 t = 0, then g(x) — (1 — x)* 2 , y* = 1, and C = 
B(m, n)/B(m, /?). If S 2 = 0, then g(x) * x\ y* * 1, and C“ 
B(m,n)/ B{a,n). If 6, = 5 2 = 0, then C ~ 1. 
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3.6.4 Normal Distribution 

A random variable X has a normal distribution if the p.d.f. is 

/*(*) = — ~r«P > -ooCjcCoo. (3.6.62) 

oV2tt L 2a J 

and is denoted Af(/i,o 2 ). Here /x is the mean and a 2 is the variance* 

Since X = /* + oZ, where Z is the standard normal variable denoted by 
jV(O t 1), we consider only generation from N( 0, l). As we mentioned in 
Section 3.2, the inverse transform method cannot be applied to the normal 
distribution and some alternative procedures have to be employed. We 
consider some of them. More about generation from normal distribution 
can be found in Fishman [12]. 

Procedure N-I 

This approach is due to Boa and Muller [6]. Let us prove that, if U l and U 2 
are independent random variates from ^1(0, I), then the variates 

z,«( — 2lnt/,) ,/2 cos2TK/ 2 (3.6.63) 

Z 2 «=(- 2)n*/ 1 ) l/2 sin27r£/ 2 

are independent standard normal deviates. To see this let us rewrite the 
system (3.6.63) as 

Z, = (2K)' /2 cos2wl/ (3.6.64) 

Z 2 = (2K) ,/2 sin2*rt/ ( 

where V is from exp(l) and U 2 — U. It follows from (3.6.64) that 
Z 2 + Z| = 2V and -=^ = tan2 irU. 

The Jacobian of the transformation 


du 

du 

9z, 

3 z 2 

dv 

jb 

dz, 

3 z 2 


47 TV 

4lTV 

Z \ 

*2 


— z 2 cos 2 27tm cos z 27 rn 

2i rzf 2 ttz x 


1 

4 nv 



_L 

2 IT 
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and 

fz,z 1 ( z \' z 2 ) ' m fu.vi»’V)\J\ = 2^ exp (~ )' (3.6.65) 

The last formula represents the joint p.d.f. of two independent standard 
normal deviates. 


Algorithm AT-/ 

1 Generate two independent random variates U l and (J 2 from %(0, 1). 

2 Compute Z, and Z 2 simultaneously by substituting U x and U 2 in the 
system of equations (3.6.63). 

Procedure A - 2 

This procedure is based on the acceptance-rejection method. Let the r.v. X 
be distributed 


/*(*> 



x > 0. 


(3.6.66) 


Since the standard normal distribution is symmetrical about zero* we can 
assign a random sign to the r.v. generated from (3.6.66) and obtain an r.v. 
from N( 0, l). 

To generate an r.v. from (3.6.66) write f x (x) as 
/*(*)«= Ch(.x) g(x) 


where 


h(x) = e~ x 


(3.6.67) 


C = 



(3.6.68) 


g(x) = exp 


(*- l ) 2 
2 


(3.6.69) 


The efficiency of the method is equal to Vw/ 2e =r0.76. 
The acceptance condition 

U < g( Y ) is U < exp[ - ( Y ~ I f/2 ] , 

which is equivalent to 


In U> 


(r - 1 r 


(3.6.70) 


(3.6.71) 


where Y is from exp(l). 


2 
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Since — In U is also from exp(l), the last inequality can be written 

( V. - l ) 2 

2 ■ (3.6.72) 

where both V 2 = — In U and V, = Y are from exp(l). 

Algorithm N-2 

1 Generate V t and V 2 from exp(l). 

2 If V 2 <(V l - l) 2 /2, go to step 1. 

3 Generate U from %(0, 1 ). 

4 If t/ > 0.5, deliver Z— - V,. 

5 Deliver Z =* V v 


Remark In order to obtain Algorithm N-2 we can represent f x (x) as 

f x {x) = Ch r {x)(\- H Yi (x)y 

where 

h Yi {x) *=h(x) = «?"* 

H Yi (T(x)) «= \ - e~ r<J,) 
r(x) = j(A- i) 2 , 
and then apply Algorithm AR-3'. 

Procedure N-3 

In this procedure we make use of the logistic distribution [32] 

e ~*/» 

h(x,$) , -oo<x<oo. (3.6.73) 

0[l + e-*'"] 2 

It is shown numerically in Ref. 32 that 9* = 0.626657, 

^ = 0.9196 (3.6.74) 

and 


g(x)»0.25 


I + exp( - 1 5957x) 2 exp| -y^ -(- 1 ,5957x j . (3.6.75) 


Algorithm N-3 is as follows. 
Algorithm N-3 

1 Generate £/, and U 2 from %(0, I ). 

2 Y «- - 0.626657 ln(l/t/— 1 ). 

3 If U < g(Y), deliver Y. 

4 Go to step 1 . 
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Procedure N-4 

This procedure is based on the relationship between the normal distribu- 
tion with chi-squared distribution and a vector uniformly distributed on 
the zi'dimensional unit sphere. 

Let Zp . . . , Z n be i.i.d. r.v.’s distributed ;V(0, 1) and let X — 
then it can be shown by the multivariate transformation method that the 
vector 


r-tr, •*t) (3 - 6 - 76) 

is distributed uniformly on the /i-dimensional unit sphere.* 

Now taking into account that X 2 — Z 2 has the chi-squared distribu- 
tion with n degrees of freedom (see Section 3.6.8), the algorithm for 
generating from N(0, /), where / is a unit matrix of size n, is as follows. 

Algorithm N-4 

1 Generate a random vector Y ** (/,, . . . , Y n ) uniformly distributed on 
the n-dimensional unit sphere, 

2 Generate a chi-square distributed random variate x 2 with n degrees 
of freedom. 

3 Z k ~XY k ,k~ I n. 

4 Deliver Z=»(Z„... f Z w ). 

Since the efficiency of the algorithm for generating Y ** ( Y v . . . , Y n ) ( see 
example 5, Section 3.4.2) decreases when n increases, it would be interest- 
ing to find the optimal n in order to minimize the CPU time while 
sampling from iV(0, /). 

Procedure N-5 

This procedure relies on the central limit theorem, which says that if X jy 
i » 1, . . . , n, are i.i.d. r.v/s with E( X t ) = p and var ( ® o 2 y then 


E X, - "M 

Z=^— (3.6.77) 

n * o 


converges asymptotically with n to JV(0, 1). Consider the particular case 


•An alternative algorithm for generating a vector uniformly distributed on the n-dimensiona! 
unit sphere is given in example 5, Section 4.3.2. 
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when all X,, 


1 are from ^1(0,1). We find that 


a 


1 

VT2 


n 


2 £4 - «/2 



(3.6.78) 


A good approximation can already be obtained for n = 12. In this case 

12 

6 - (3.6:79) 

i-1 

Algorithm N-5 is straightforward. 

Algorithm N-5 

1 Generate 12 uniformly distributed random variates t/, 2 from 

%( 0 , 1 ). 

2 - 6 . 

3 Deliver Z, 


Procedure N-6 

Another approximation technique for generating from /V(0, 1) is given in 
Tocher [33]; it makes use oF the following approximation: 

2e " ■*' 


(1 + e -*') 2 

for x > 0 and k = \/8/Tr . 

The c.d.f. for the approximation is 

l 


(3.6.80) 


*>(*)’ 


1 — e 


- At 


- 1. 


The inverse transformation is 


v 1 i l + U 

* " * T^V ' 


(3.6.81) 


Attaching a random sign k to this variate we obtain the desired variate 

Z = kX. 
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Algorithm 7V-6 

! Generate U } and U 2 from %(0, 1). 

2 X<- V^/8 In [(1 + t/,)/l - (/,)}. 

3 If U 2 < 0.5, deliver Z= -X. 

4 Deliver Z « X. 


3,6.5 Lognormal Distribution 

Let X be from /V(fi,o 2 ). Then Y = e x has the lognormal distribution 
with p.d.f. 



1 

(lny - n) 2 

fv(y) = < 

CX ^ 

oy 

2o 2 

Algorithm LN-l 


0 , 


1 Generate Z from /V(0 t 1). 

2 A < — n + oZ , 

3 Y+-e x . 

4 Deliver Y, 


0 <y < oo 


otherwise. 


(3.6.82) 


3.6.6 Cauchy Distribution 

An r.v. X has a Cauchy distribution denoted by C(a y f3) if the p.d.f. is 
equal to 

fx(x) * — f — ~=r > a > 0, /3 > 0, — oo < x < oo 

*[fi 2 + (x - Of) ] 


(3.6.83) 


and the c.d.f. is equal to 

+ (3.6.84) 


Applying the inverse transform method, we obtain 

X = F -\u) = a + (it™[4U-\)}=a--^-^. (3.6.85) 


Algorithm C-l describes the necessary steps. 


Algorithm C-l 

1 Generate U from U( 0, 1). 

2 X*—a — j8/tan(wC/). 

3 Deliver X. 
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The next algorithm is based on the following two properties: 

(a) If Z x and Z 2 are independent variates from jV( 0, I) then Y = Zj/Z 2 
is from C(0, 1). 

(b) If X is from C(0, 1) then Y * pX + a is from C(a, /?). The last 
property can be obtained directly from the transformation formula (3.5.9) 

/r(y)=fx( x ) ^ IX-*) 

Algorithm C-2 

1 Generate Z, and Z 2 from jV(0, 1). 

2 X+-f}Z l /Z 2 + a. 

3 Deliver X. 

The third algorithm is based on the following property [18]: 

(c) If y, and Y 2 are independent r.v.’s both from %( — and 
Y? + r 2 2 < \ then X * YJY 2 is from C(0, 1). 

Algorithm C-3 

Generate l/, and t/ 2 from %(0, 1). 

2 and 

3 if y, 2 + y 2 2 >^ go to i. 

4 juv-^/^ + a. 

5 Deliver X. 

The efficiency of the algorithm is 

P(Y?+ Yi 

so the algorithm is relatively efficient. 

3.6.7 Welbul Distribution 

An r.v. has a Weibul distribution if the p.d.f. is equal to 

^x°“ l e ~ (JC/W \ 0 < x < oo, a > 0, fi >0 

0, otherwise 

and is denoted by fV(a,fi). 

To generate X by the inverse transformation method note that 



(3.6.86) 


(3.6.87) 


so 


* = /?( — ln(l ~ U))' /a . 


(3.6.88) 
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Since 1 - U is also from ^(0, 1), we have 

— In U)' /a (3.6.89) 

or 

(| (3.6.90) 

Taking into account that -ln(t/) is from exp(l), the algorithm for 
generating an r.v. from a Weibul distribution can be written as follows. 

Algorithm W-l 

1 Generate V from exp(l), 

2 x<-pv y / a . 

3 Deliver X. 


3.6.8 Chi-Square Distribution 

Let Z l? . . . , Z k be from N{ 0, 1), Then 

k 

y«2 z , 2 (3.6.91) 

I- 1 

has the chi-square distribution with k degrees of freedom and is denoted 

x 2 (*)- 

Formula (3.6.91) says, “the sum of the squares of independent standard 
normal random variables has a chi-square distribution with degrees of 
freedom equal to the number of terms in the sum”. One approach for 
generating a chi-square variate from x 2 (*) is to generate k standard 
normal random variables and then apply (3.6.91). 

Another approach makes use of the fact that x 2 (*) is a particular case 
of a gamma density with gamma parameters a and fi equal, respectively, to 
k/2 and 2. 

Consider two cases. 


case 1 if k is even, then Y can be computed as 


y= -2 In | n C/,J. 


(3.6.92) 


Formula (3.6.92) requires k/2 uniform variates compared to k in (3.6,91). 
It also requires one logarithmic transformation, compared to k logarithmic 
and k cosine or sine transformations for generating Z i from N{ 0, 1), 
i - 1, ...» * (see (3.6.63) and (3.6.64)). 
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CASE 2 If k is odd, then 

/ A/2- 1/2 v 

Y — — 2 In I II UA + Z 2 , 
where Z is from jV( 0, I) and U i is from 9l(0, 1). 


(3.6.93) 


For k > 30 the normal approximation for chi-square variates can be 
used based on the following formula [24]: 

Z = V2Y - V2k — 1 . 

Solving for K, the chi-square variate, we obtain 

(Z+ V2 k - 1 ) 2 


Y = 


(3.6.94) 


Remark Let Y t , Y 2 , and L 3 be chi-square random variables with degrees 
of freedom 2(a + fi), 2a, and 2 j3, respectively; then 


has a beta density with parameters a and fi. Applying formula (3.6.92), we 
get 

tt ,_ 

+ + r ' ’ ^2 (a+fi) 


3.6.9 Student’s t Distribution 

Let Z have a standard normal distribution, let Y have a chi-square 
distribution with k degrees of freedom, and let Z and Y be independent; 
then 

X = — — — (3.6.95) 

W/k 

has a Student’s t distribution with k degrees of freedom. To generate X we 
simply generate Z as described in Section 3.6.4 and Y as described in 
Section 3.6.8 and apply (3.6.95). For k > 30 the normal approximation can 
be used. 

3.6.10 F Distribution 

Let T, be a chi-square random variable with k x degrees of freedom; let 
Y 2 be a chi-square random variable with k 2 degrees of freedom, and let T, 
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and Y 1 be independent. Then the random variable 

X ~TJT, (36 ' 96) 

is distributed as an F distribution with k l and k 2 degrees of freedom. To 
generate an F variate we first produce two chi-square variates and then use 
(3.6.96). 


Remark 1. If X has an F distribution with k and k 2 degrees of freedom, 
then 1 /X has an F distribution with k 2 and k ] degrees of freedom. 


Remark 2. If X is an F-distributed random variable with k , 
degrees of freedom, then 


*!*/*■ 

1 + k,X/k, 


and k 2 


(3.6.97) 


has a beta density with parameters a * k } /2 and /? « k 2 / 2. 


3*7 GENERATING FROM DISCRETE DISTRIBUTIONS 

In this section we describe several procedures for generating stochastic 
variates from most of the well known discrete distributions. We start with 
the inverse transform method, which is generally easily implemented and is 
widely used. 

Let A be a discrete r.v. with probability mass function (p.m.f.) 

Pr (X = x k ) = P k . /c- 0 , 1 ,... (3.7.1) 

and with c.d.f, 

L 

ft. ~ Pr(X X.) - £p r (3.7.2) 

» n 


Then 

Pr(g*_,< t/<g A )= f*‘ = = g_,»0, (3.7.3) 

where U is from %( 0, 1). Thus 

X= min{jc:g A _,< U<g k ). (3.7.4) 

Algorithm IT-2, which is called the inverse transform algorithm, describes 
generating discrete r.v.’s. This algorithm is based on logical comparison of 
U with g k s and is as follows. 



96 


RANDOM VARIATE GENERATION 


Algorithm IT-2 

1 C<—P 0 . 

2 B+-C. 

3 a:^-o. 

4 Generate U from %(0, 1). 

5 If U < B (U < g k \ deliver X = x k . 

6 a: <-*+!. 

7 A k + \C + i — A k+l P k ), 

8 B^-B + C (gk+) = 8k + \)' 

9 Go to step 5. 

Here P 0 and A k + X = P k + ] /P k are distributed dependent. The recurrent 
formulas 

P„.^A k + ,P k (3.7.5) 

g* + i = S* + /> *-n (3-7.6) 

in steps 7 and 8 are straightforward for calculation. 

Most discrete r.v/s are integers nonnegative valued, that is, x k — k , 
/c *= 0, 1, . . . . Later, we consider only these r.v/s. It is easy to see that the 
mean number of trials 

00 oo 

c« i + 2 = 2 1 + *(*) (3.7.7) 

*- I Ac - ! 


is equal to the expected value plus one additional trial. 

Table 3.7.1 represents the values of P Q , A k + {y and C for most well known 
discrete distributions. 

In order to generate an r.v. from a specified discrete distribution, we 
take the corresponding values P 0 and A k + ] from Table 3.7.1 and then run 
Algorithm IT-2. 

In many cases we can improve the efficiency of the inverse transform 
method IT-2 by starting the search of X at k m y m being an interior point 
(for example, mode, median, etc.), rather than at k « 0. We assume that 
tables of P k and g k are available. 

The procedure is as follows. If U > g m , then 


If U <g m , then 


8m + i ~~ 8m+i - I + i (3.7.8) 

+t 855 ^m + i- l^m + f * / = 1,2, (3.7.9) 

8m- i - 8m-i+ l - P m-i+\ (3.7.10) 

P m -i+\A'^ i9 /» (3.7.11) 



Table 3,7.1 Discrete Distributions 
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u ) unu > x > 
u + l u ‘0)^01 
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where and A ,f m ^ k are distribution dependent and their values are 

available to compute. 

Algorithm IT-3 describes the necessary steps. 

Algorithm IT-3 

1 D *~8m- 

2 E+-P m - 

3 Generate U from 91.(0, I). 


4 

K*—m. 



5 

If U>g k , go to step 

12. 


6 

D+-D-E (g*_,= 

g k ~ 

p k y 

7 

If U> D, deliver X = K; 

go to 

8 

K*-K- 1 . 



9 

If Af = 0, deliver A" = 

K, go to 

10 

£^£4«_, <£*_,= 

K- 

,£*)• 

11 

Go to step 6. 



12 

£*-*+ 1. 



13 

E*-EA k+ , (£*+ i = 

*k + 

,£*)• 

14 

D*-D + E. 



15 

If U<D t deliver * - £. 


16 

Go to step 12. 




Table 3.7.2 represents the values of P 0 , m(mode), A' k+l and A k + 1 for 
most well known discrete distributions. 

It is easy to see that for an integer m the number of trials (number of 
logical comparisons of V with g' x s) is the following r.v.: 


2 + (m - X ) y if x * 0, 1 , , . , , m 

1 + (X ~ m), if x ~ m + 1 , m + 2, 


(3.7.12) 


The mean number of trials is 


C«£0,)=2 [2 + {m-x)]P x - 2 [l + (*-«)]/», 

x — 0 x m + ! 

mm oo m oo m 

= 2 p * + 2 p * + 2 p * + m 2 K - « 2 p x - 2 xp x 

jf-0 jc-*0 jc — m+\ < =* 0 x — m + 1 x — 0 

oo m 

+ 2 xP x =g m +\+rng m -m(\-g m )- xP x + E(X) 

m 

*-0 


(3.7.13) 



Table 3.7.2 Discrete U 
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where 

m 

r(«)*2 *2 xP x -g m + rn-2mg m . (3.7.14) 

x-0 

It follows from (3.7.7) and (3.7.13) that Algorithm IT-3 is more efficient 
than Algorithm IT-2 for m such that y(m)>0. However, y (m) is not 
necessarily positive for each m. 

The following example illustrates this point. 


Example 1 Assume that the r.v. X has the following p.m.f.: 

x**0 

P s< 2* X * ^ 

x=2 

0, otherwise. 

Let m = 1; then y(1) » 2Z' xmf) xP x - g m + 1 - 2g m - 1 - \ + 1 -f - 
-0.25 < 0, and therefore Algorithm IT-2 is more efficient than Algorithm 
IT-3. 


Nevertheless, in many cases it is possible to choose the starting point m 
in such a way that y(w)>0, and therefore it is possible for IT-3 to be 
more efficient than IT-2. 

Lemma 3.7.1 If there exist m > 0 such that 

m 

for g m s{, (3.7.15) 

X — I 

then y (m) > 0. 

Proof Condition P 0 < 2”_,(2jc - 1 )P X is equivalent to 

(3.7.16) 

jc-0 

and, correspondingly, condition g m < \> m > 0, is equivalent to 

m — 2mg m > 0. 

Both (3.7.16) and (3.7.17) yield y(m) > 0. 


Note l We can see that Lemma 3.7.1 is valid if P 0 < 


(3.7.17) 

Q.E.D. 
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This condition is not restrictable and holds for practically all discrete 
distributions. 

Lemma 3.7.2 y(m) achieves its maximum at points m 0 or m 0 + 1 where 
m 0 * max [m : g m < |), depending, correspondingly, on whether g mo + 

g OTo + ,< 1 OTg mo +g mo + x > 1 . 

Proof It is straightforward to obtain from (3.7.14) that 

Ay(m) « y(m + |)- Y (m)= 1 ~g m ~g m+l . (3.7.18) 

For m < m 0 we have g m + , < I, and therefore Ay (m) > 0; for m > m 0 

we have, correspondingly, g m + g m+I > 1 and Ay(/w) < 0. Therefore y(m) 
is a unimodal function with the maximum at points m 0 or m 0 + 1 , 
depending on whether g mo + g mo + , < 1 or g„, o + g mo + , > 1 . Q.E.D. 

Afore 2 fn other words. Lemma 3.7,2 says that y (m) achieves its maxi- 
mum at the median or at a point neighboring the median on the left. 

As a corollary from these two lemmas we obtain the following theorem. 

Theorem 3.7.1 The optimal starting point in Algorithm IT-3 is either the 
median m 0 - max {m : g m < 5 ), if P 0 < 2 m i|'(2.x: - \)P X and g m<) + 
+ 1. orm 0 + 1, if P 0 < 27"! \2x - l)P x and g m(j + g m<) + 1 > 1. 

Note 3 Theorem 3.7.1 is valid not only for integer nonnegative valued 
r.v.’s, but for any discrete r.v. with values x Qy x }1 ..., since Algorithm IT-3 
is determined not by the sequence x 0 , x }t . . . , but by its indices 0 , 1, 

In the rest of this chapter we consider some alternative procedures for 
generating discrete r.v/s. Generally, procedures for generating discrete 
variates are simpler than procedures for generating continuous variates, 
and we describe them only briefly. 

3.7.1 Binomial Distribution 

An r.v. X has a binomial distribution if the p.m.f. is equal to 

P, = ( n x ) P *(l- P y-\ * = 0 n (3.7.19) 

and is denoted by B(n,p). Here 0 < p < 1 is the probability of success in a 
single trial, and n is the number of trials. 

To apply the inverse transform method IT-2 we must check the follow- 
ing condition after step 5: if K=*n- 1, terminate the procedure with 
X *= K — n. 
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It is also worthwhile to note that, if Y is from B( n,p) y then n - Y is from 
B(n , 1 -p). Hence for purposes of efficiency we generate X from B(n,p) 
according to 


f Y — B(n,p)iip 
I Y — B(n, \ - p)ifp>\. 


(3.7.20) 


For larger n the inverse- transform procedure becomes time consuming, 
and we can consider the normal distribution as an approximation to the 
binomial. 

As n increases the distribution of 


X-np + \ 
[np{\ -p)}' n 


(3.7.21) 


approaches N( 0, 1). 

To obtain a binomial variate we generate Z from jV( 0, 1), solve (3.7.21) 
with respect to X , and round to nonnegative integer, that is, 


X =* max {o, -0.5 + np + Z(np( 1 -p ))' /2 } } , (3.7.22) 


where [a] denotes the integer part of a. 

We should consider replacing the binomial with the approximate normal 
when np > 10 for p > { and n( 1 -p) > 10 for p < \ . 

It is shown [22] that, if m is the mode, then for large n the mean number 
of trials in Algorithm IT-3 is equal to 


C-1.5 + y| ^np(y~P) ■ (3.7.23) 

Comparing both Algorithms IT-2 and IT-3 (compare (3.7.7) with 
(3.7.23)), we can see that for large n the mean number of trials is 

proportional to np and yjnp{\ ~p) , respectively. 

So for large n Algorithm IT-3 is essentially more efficient than Algo- 
rithm IT-2. 

The acceptance-rejection method can also successfully be implemented 
for generating from B(n y p) (see Ahrens and Dieter [4] and Marsaglia [20]). 
Description of algorithms for this and their efficiency can be found in 
Fishman’s monograph [12]. 


3.7.2 Poisson Distribution 

An r.v. X has a Poisson distribution if the p.m.f. is equal to 



and is denoted by P( A). 


x * 0, 1, . . . ;A>0 


(3.7.24) 



GENERATING FROM DISCRETE DISTRIBUTIONS 


103 


It is well known (Feller [II]) that, if the time intervals between events 
are from exp(l/A), the number of events occurring in an unit interval of 
time is from F(A). 

Mathematically, it can be written 

x x+\ 

2 ^ 1 5 2 T>, (3.7.25) 

»*0 i — 0 

where 7J, / = 0, 1 , . . . , X + 1, are from exp(l/A). 

Since 7] = — (1 /A ) In the last formula can be written as 
x x + 1 

lnU 5* A^ = 0, 1 (3.7.26) 

/-0 i *0 

or 

X X+ I 

n U, >e~ x > II Ui, * = 0,1 (3.7.27) 

i -0 (=0 

The following algorithm is written with respect to (3.7.25): 

1 >4 «— 1 (g*=l). 

2 K*r-0. 

3 Generate U k from %(0, 1). 

4 A*—U k A (g k + \=g k U k ). 

5 If A < e ~ \ deliver X = K. 

6 K*-K+ 1. 

7 Go to step 3. 

For large A(A > 10) we can approximate the Poisson distribution by 
normal distribution. As X increases, the distribution of 


Z 


*-A + ^ 
A'/ 2 


(3.7.28) 


approaches A/(0, l ). 

To obtain a Poisson variate we generate Z from jV( 0, 1), then by analogy 
with (3.7.22) we obtain 

X = max (0, [ A + Z l/2 - 0.5]), (3.7.29) 

where [a] is the integer part of o. 

It is shown in Ref. 22 that, if m is the mode, then for large n the mean 
execution time in Algorithm IT-3 is similar to (3.7.23) and is equal to 


C = 


1.5 + 



A'/ 2 . 


(3.7.30) 
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The mean number of trials in both Algorithms IT-2 and IT-3 are 
proportional, respectively, to A and A I/2 , and therefore Algorithm IT-3 is 
again essentially more efficient than Algorithm IT-2. 


3,73 Geometric Distribution 


An r.v. has the geometric distribution if the p.m.f. is equal to 

rx=P{'-pY> x = 0, 1, . . . , 0 <p < I (3.7.31) 

and is denoted by Ge(p). Geometric distribution describes the number of 
trials to the first success in a serial of Bernoulli trials. 

The following procedure describes generating from Ge{p) and is based 
on the relationship between exponential and geometric distribution. Let Y 
be from exp (0); then 

Pr (jc < Y<x + !) = ^ r +, e-' /ff dy=*e- x//, (] 

P J X 


(3.7,32) 

which is Ge(p = 1 - e ~ ' /p ). 

For - 1 /In < 1 -p) (3.7.32) is identical to (3.7.31). Therefore 


In U 


V 

In (1 -p) 


ln(l-p) 


where V * -In (l/) is a standard exponential variate, that is, X is from 
Ge(p). Hence to generate an r.v. from Ge(p) we generate an r.v. from the 
exponential distribution with fi — — 1 /In (1 — p) and round the value to an 
integer. 

CPU time for this procedure is constant, whereas the CPU time for the 
inverse transform method is proportional to I/p. However, because this 
procedure requires generation from the exponential distribution and 
rounding, it is more efficient than Algorithm IT-2 only for p < 0.25. 


3.7.4 Negative Binomial Distribution 

The p.m.f. for the negative binomial distribution is 

P x = ( x + r x ~ l )pV-p) x , *-0,1... . ;p>0 (3.7.34) 

and is denoted by NB{r 7 p). When r is an integer the distribution is called 
Pascal distribution, which describes the number of successes occurring 
before the rth failure in a series of Bernoulli trials. This implies that 
geometric distribution is a special case of Pascal distribution with r = 1. 
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The following algorithm describes generating from Pascal distribution 
with parameters r and p denoted PS(r y p). 

1 *<- 0 . 

2 y<-o. 

3 Generate U x+ Y from ^(0, I), 

4 If U x + Y > p, go to step 8. 

5 Y^Y + I. 

6 If Y = r y deliver X. 

7 Go to step 3. 

8 X*-X +\ . 

9 Go to step 3. 

An alternative procedure is based on the reproductive property of the 
negative binomial distribution analogous to that for the gamma distribu- 
tion. Let X iy f = denote a sequence of i.i.d. r.v/s from NB(r iy p). 

Then X = 'k" w .\X l is from PS(r y p ), where r = 27— i r #* 

Suppose that ■** I, #■* which means that are 

from Ge(p)\ then X = 2-..!*, is from NB(r y p ), 

The algorithm is straightforward and contains the following steps; 

1 Generate X v . . * y X r from Ge(p ). 

2 *<- 2 ^*,. 

3 Deliver X. 

This procedure is more efficient than the inverse transform method IT-2 
for p > 0.75. 

Another possible method for generating an rv. from NB{r y p) makes use 
of the following relationship (see Johnson and Kotz [18, p. 127]): 

Pr(* <*)= Pr(T>r), (3.7.35) 

where X is from NB(r y p) and Y is from B{p,r + k). The reader is asked to 
describe an algorithm based on (3.7.35), assuming that r.v. Y from B(p y r 
+ k) is given. 

The next procedure is based on the relationship between negative 
binomial distribution with gamma and Poisson distributions. 

Suppose we have a mixture of Poisson distributions, such that the 
parameter \ of the Poisson distributions 

P(X = x 


x ■ 0 , 1 , . . . 
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varies according to that is. 


A(M = 


A° 

rr(«) 


'e-W, 


X > 0, a > 0, /3 > 0. (3.7.36) 


Then 


P{X = x) = (°°P{X = x\\)f A (\)d\ (3.7.37) 

J o 


= [n(«)]'' 
= [/?“r(a)]-’ 


f °o\ a -'e-Vf , \* e - x .. 

I : a A 

J 0 xl 

fJ > \ a+ *-'exp[-\(l3-'+ l)]dA 




So A" is from NB(a y 1 /(/? + 1)). 

It is obvious that, when A is from G(r y (1 — p)/p), (3.7.37) is identical to 
(3.7.34). The algorithm is as follows: 


1 Generate an r.v. A from G(r,p/( 1 — p)). 

2 Generate X from /’(A), 

3 Deliver X. 


It is not difficult to see that an alternative algorithm for generating an r.v. 
from NB(r,p) is the following: 

1 Generate A from G(r y 1). 

2 Generate X from P(\p/(\ — p)) 

3 Deliver X , 


3.7.5 Hypergeometric Distribution 


An r.v. X has a hypergeometric distribution if the p.m.f. is equal to 

D \ x J\ m — x / . y x 

P x — - , max (0, n l + m — n) < x < min (n, , m) 

\ m) 

(3.7.38) 

and is denoted H{n y m y n x ), Hypergeomelric distribution describes sam- 
pling without replacement from finite population. It has three parameters, 
n y m y and n ,, which have the following meanings: n y the size of the total 
population in two classes, m, the size of the sample (m < n) that is taken 
from the total population n without replacement, and n ,, the size of the 
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population in the first class (n — n x is the size of population in the second 
class). 

Generation from H{n 7 m,n x ) involves simulating a sampling experiment 
without replacement, which is merely a Bernoulli trials method of generat- 
ing from 2*0, p) with n and p altering (varying) depending, respectively, on 
the total number of elements that have been previously drawn from the 
total population and the number of the first class elements that have been 
drawn. 

The original value n = n 0 is reduced according to the formula 

1, f m (3.7.39) 


when an element in a sample of m is drawn. 

Similarly, the value p = p 0 *= w hen the ith element in a sample of n 

elements is drawn, becomes 


Pi ” ; > i * 1 , . . . , tn , 




- 1 


(3.7.40) 


where S ~ 1 when the sample elements (/ = l) belong to the first class, and 
S « 0 when the sample elements (i — 1) belong to the second class. 


EXERCISES 


I Describe an algorithm for generating from 
bution) 


/*(*) = ir'exp 


p J’ 


Laplace (double exponential distri- 
ct? > 0 , - oo < x < oo. 


using the inverse transform method 

2 Apply the inverse transform method for generating from extreme value distribu- 
tion 


r «p 






— 00 < jf < oo. 


3 Describe an algorithm for generating from logistic distribution 
ex p[-(x~a)/P) 


/A*)' 


-oo<x<oo,/?>0, a>0. 


P\ 1 + exp l ~ (x — a)//}]f 
4 Consider (he triangular random variable with the density function 


fAA 


o, 

x-2a 

(b-a) 2 ' 

(2b -x) 

(b-a) 2 ' 


if x < 2a or x > 2b 
if 2a < x < a + b 


if a + b < x < 2b J 
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and the distribution function 




0, 

(*-2a) 2 
2(b-a) 2 ' 

, (zfe-x ) 2 

2(b-a) 2 


if x < 2a 

if 2a < x < a + b 

if a + b < x <2b 


[ I, if*>2 b 

This random variable can be considered as a sum of two independent random 
variables uniformly distributed between a and b. Show that, applying the inverse 
method, we obtain 


I 2 a + (b-a)V2U if 0 < t/ < 0.5 

[2fc+(a-fr)V2(T- l/) , if 0.5 <{/<!. 

5 Let 


fc.jc, x,_,<x<x„ »'-l n 

1 0 T Otherwise 

x 0 = a y x n ~b, Ci> 0, a > 0. 


Using the inverse transform method, prove that 


X 


, 2 {U-F ^ x ) 
c, 


*/2 


where F, *= f Cjxdx. Describe an algorithm for generating from /*(*). 
6 Let X) X m be i.i.d. r.v/s from exp ( X ). 


(a) Show that Y x =* min (X,, . . . , X n ) is distributed exp(nA), 

(b) Describe an algorithm for generating from T,. 


7 Let U l9 . be from %(0 T 1), Prove that the orth order statistic U (a) is 
from Be(a yi 8). 

8 The joint density of the r.v.'s X and Y is of the form f(u 2 + v 7 ) for all u and v. 
Show that their ratio X/Y has a Cauchy density, 

9 Describe two alternative algorithms, correspondingly, for examples 4 and 5 of 
Section 3,4 by making use of Theorem 3,4.2, 

10 Describe algorithms for generating from the following p.d.f.’s: 


(a) fx.r(*>y) = ce~ {x + y \ x > 0, y > 0. 

0) fxr(xy)”cxe~ xr , 0 < x < 2,y > 0. 

(c) For generating from where = 5=6 ( 1, 2, 3), and 


1 

1 

0 


1 0 

2 0 
0 3 


2 
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II Let L, and Y 2 be i.i.d. r.v.’s from j). Prove that, if 

r?+ r? <- A , 


then Y t /Y 2 is from C(0, I). 

12 Let y,,...,y n be i.i.d. r.v.’s from exp(I) and let X ~ Prove that the 

vector 


Y = (Y y ,..., 



K 

X 


) 


is distributed uniformly on the simplex 27- ) y, *1,0 <y, < 1, / — I n, 

13 Let Z,, . . , , Z„ be i.i.d. r.v.’s distributed N(Q. 1) and let X =» (27- tZ 2 ) 1 / 2 . Prove 
that the vector 


r = (Y ,r„)~ 




is distributed uniformly on the sphere 2y 2 = 1. 

14 Let U {{} V {H} be order statistics from %(0, 1). Prove that the vector X * 

- U V) , X 2 = U (2} - C/ (l) X n * U (n) - U (n _ „ , 

is distributed uniformly inside the simplex 27- 1 -*, < 1, > 0. 

15 Consider the p.d.f. 


Let 



x > 0. 


- f3e x > 0, fl > 0 


Using (3.4.10), prove that the maximum efficiency is achieved when /?» 1. 

16 Describe an algorithm for generating from Be{a,fi\ making use of the inequal- 
ity 

*“-'<1 - x) fi + ~ xf - 1 

and assuming 

/»(^)-^g[Ar" , + (]- J r) /, - , ] ) 0 < x < I, /3 > 0 

g(x) -[*—' + (I -xf -xf-' 

c < q + l> 

aP T(a)T(p) ’ 
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Compare the efficiency of this procedure with the efficiency of Johnk’s procedure, 
Be-4. 

17 Describe an algorithm for generating from <J(a, 1) by the acceptance-rejection 
method AR-1, assuming 

h(x)=pEr(P y m) + ( I -p)Er{p y m+ 1), 
that is, A(x) is a mixture of two Erlang distributions, where m — [a] and ft — a/m. 

18 Prove that Procedure Be-6 is more efficient than Procedure Be- 5 for a > 2, 
P>2. 

19 Describe an acceptance-rejection algorithm for generating an r.v. from N(Q, 1), 
representing /*(*) * Cg(x)h(x,p) and assuming that 

+ (^) . - 00 < X < 00 . 

Verify that the optimal ** i, the efficiency I/C = e ,/2 / V2w = 0.6578, and 
g(*)- 0.8243(1 + x 2 )e~* 1//2 . 

From Tadikamalla and Johnson [32]. 

20 Describe an algorithm for generating from truncated Erlang distribution 

/*(*)- iyr x ’ 

and find c. 

21 Prove that, if /*(*) can be represented as /*(*)* CA r| (x)(I - ffy 2 (T(x))l then 
Algorithm AR-3 can be rewritten as AR-3'. 

22 The p.m.f. for the uniform discrete distribution is 

p * m b-}+ 1 * *~ a ’ a+l b > 

where b and a are integers and b > a. Prove that X ' = [a + (b — a + 1)1/] has the 
desired distribution, and describe an algorithm for generating an r.v. from P x . Here 
[a] is the integer part of a. 

23 Let Y be from Bernoulli distribution, that is, 

P r =p y {\-p) l ' y , y~0,\,0<p<\. 

Prove that, if Vj Y„ are i.i.d. r.v/s from Bernoulli distribution, then X * jT, 

is from B(n t p). Describe an algorithm for generating an r.v. from B(n,p\ using 
the above result. For purposes of efficiency use the fact that if X is from B(n f p) f 
then n — X is from B(n, I — p). 

24 Prove (3.7.25), that is, if the time intervals between events are from cxp(l/\), 
then the number of events occurring in a unit interval of time is from P( A). 



REFERENCES 


111 


25 Prove that y = x* l ( 1 — x)* 2 is a concave function on {0, 1} and has a maximum 
equal to 




Sf'Sf* 

(<5i + 5 2 )* , + * 2 


at the point x* 


Si 

5 I + $2 


26 Let X and X } be i.i.d. r.v.’s and let Y =* aX + (l — a)Y,, where 0 < |a| < 1. 
Prove that the correlation coefficient 


a 

Or v • 

Describe an algorithm for generating a pair of r.v.’s (A \Y) for which p XY “ 0- 
Yf Prove Theorems 3.4.2 and 3.4.3. 


28 By analogy with Theorem 3.4,2 formulate a theorem that is a multidimensional 
version of Algorithm AR-L and prove it. 

29 Let X = (X,, . . . * X n ) be i.i.d, r.v.’s uniformly distributed inside an /i-dimensional 
unit sphere. Prove that the vector Y ** CS is uniformly distributed inside the 
ellipsoid 

Y t 2.Y <K 2 , 

where 2 is a symmetric and positively defined («X/i) matrix and C is the lower 
triangular matrix (3.5.13), such that 2 » C T C, Hint : Use the fact that the vector 
W n ) » KX is uniformly distributed inside the n~dimcnsional sphere 

W T W - W } 2 + Wt + ’ • + W 2 < K 2 


with radius K. 
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CHAPTER4 

Monte Carlo Integration 
and Variance 
Reduction Techniques 


4.1 INTRODUCTION 

The importance of good numerical integration schemes is evident. There 
are many deterministic quadrature formulas for computation of ordinary 
integrals with well behaved integrands. The Monte Carlo method is not 
competitive in this case. 

But if the function fails to be regular (i.e., to have continuous derivatives 
of moderate order), numerical analytic techniques, such as the trapezoidal 
and Simpson’s rules become less attractive. Especially in the case of 
multidimensional integrals, application of such rules (formulas) runs into 
severe difficulties. It is often more convenient to compute such integrals by 
a Monte Carlo method, which, although less accurate than conventional 
quadrature formulas, is much simpler to use. 

It is shown that each integral can be represented as an expected value 
(parameter) and the problem of estimating an integral by the Monte Carlo 
Method is equivalent to the problem of estimating an unknown parameter. 
For convenience we use the expression “estimating the integral” rather 
than “estimating the unknown parameter.” In Section 4.3.12 we consider 
several practical examples of estimating such parameters (integrals). 
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4.2 MONTE CARLO INTEGRATION 

In this section we consider two simple techniques for computing one- 
dimensional integrals, 

l ss f b g( x)dx, (4.2.1) 

by a Monte Carlo method. The first technique is called “the hit or miss 
Monte Carlo method,” and is based on the geometrical interpretation of an 
integral as an area; the second technique is called “the sample-mean 
Monte Carlo method,” and is based on the representation of an integral as 
a mean value. 


4.2.1 The Hit or Miss Monte Carlo Method 

Consider the problem of calculating the one-dimensional integral (4.2.1) 
where, for simplicity, we assume that the integrand g(x) is bounded 

0 < g(x) < c, a < x < b. 

Let ft denote the rectangle (Fig. 4.2.1) 

ft « {(jc ,y) : a < x < b ,0 < y < c) . 

Let (X, Y) be a random vector uniformly distributed over the rectangle ft 
with probability density function (p.d.f.) 


fxA**y)~ j c(b-a)' 


if (x,y)eSl 


otherwise. 


(4.2.2) 


What is the probability p that the random vector (X, Y) falls within the 
area under the curve g(x)? Denoting 5 * {(x,^) :y < g(x)) and observing 
that the area under the curve g(x) is 

rb 

area under g(x) = area 5=1 g{x)dx. 



Fig. 4.2.1 Graphical representation of the hit 
or miss Monte Carlo method. 
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we obtain 


f i ’g(x)dx 

area S _ ■'a / 

area ft c(b - a) c(b-a)' 


(4.2.3) 


Let us assume that N independent random vectors (A - ,, Y,), 
(X 2 ,Y 2 ),...,(X„,Y„) are generated. The parameter p can be estimated by 


P = 


A f H 
<V ’ 


(4.2.4) 


where N H is the number of occasions on which g ( XJ > Y f , / ~ 1, 2, . . . , N , 
that is, the number of “hits,” and N - is the number of “misses”; we 
score a miss if g( X t ) < Y if i = l, . . . , N, as depicted in Fig. 4.2,1. 

It follows from (4.2,3), and (4.2.4) that the integral / can be estimated by 

I^c(b-a)^. (4.2.5) 


In other words, to estimate the integral / we take a sample N from the 
distribution (4.2.2), count the number N H of hits (below the curve #(*)), 
and apply (4.2.5). 

Since each of the N trials constitutes a Bernoulli trial with probability p 
of a hit, then 


E(e x ) = c(b- a )E{^}=c(b-a)^^-pc{b-a)-I, 


that is, 9 X is an unbiased estimator of /. 

The variance of p is 

varp«var(^)«-^var(^„)--^p(l - p), 
which, together with (4.2.3), gives 


-[«<*- «)-/]• 


N [<•(*-«)]’ 


Thus 


= -[c(*-a)-/] 
and the standard deviation 


(4.2.6) 

(4.2.7) 

(4.2.8) 


var0, * [ c(6 — a)] 2 \arp = [c(6 — fl)] 2 ~/’(l ~ p) (4.2.9) 
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Note that the precision of the estimator 0^ which is measured by the 
inverse of standard deviation, is of order N " 1/2 . 

How many trials do we have to perform, according to the hit or miss 
Monte Carlo method, if we require 

P[\e x -I\<E] > a? (4.2.10) 

Chebyshev’s inequality, 

P[\0x ~ '!<*] ^ 1 (4-2.11) 

e 


together with (4.2.10), gives 

var0, 


Substituting (4.2.9) in (4.2.12), we obtain 


, , P( 1 ~p)[c{b~a)] 2 
a < I r 

Ne 2 

Solving (4.2.13) with respect to yV, we have 

N > (I ~P)p[f(fe-Q )] 2 

(1 - a)e 2 


(4.2.12) 


(4.2.13) 


(4.2.14) 


which is the required number of trials for (4.2.10) to hold. 

When N is sufficiently large we can apply the central limit theorem, 
which says that for yV sufficiently large the random variable (r.v.) 




0,-1 


(4.2.15) 


is distributed approximately according to the standard normal distribution, 
that is, 

P{0 x <x)^(x) 9 (4.2.16) 

where 


+(*)* — — / * (2/2 dt. 

V2v J — 


(4.2.17) 


We can easily verify that the confidence interval with level 1 —2a for I is 

[p(l -p)Y /2 {b- a)c 


e i± z a - 


where 


N l/2 


(4.2.18) 

(4.2.19) 
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Hammersley and Handscomb [10] write: 

Historically, hit or miss methods were once the ones most usually pro- 
pounded in explanation of Monte Carlo techniques; they were of course, the 
easiest methods to understand (particularly if explained in the kind of 
graphical language involving a curve in a rectangle). 

Hit or Miss Monte Carlo Method Algorithm 

1 Generate a sequence {Uj}j*i of 2 N random numbers. 

2 Arrange the random numbers into N pairs (l/,, (/[), 
(U 2 , U 2 (U N * U' N ) in any fashion such that each random number U l is 
used exactly once. 

3 Compute 

X^a + mb-a) and *(*,), 1,2 JV. 

4 Count the number of cases N H for which g(X t ) > cllj, 

5 Estimate the integral I by 


4*2.2 The Sample-Mean Monte Carlo Method 

Another way of computing the integral 

/- f*g(x)dx 

* a 

is to represent it as an expected value of some random variable. Indeed, let 
us rewrite the integral as 

#«/* Tj\ f * (x)dx ' (4-2.20) 

j a f x (x) 


assuming that f x ( x) is any p.d.f. such that f x (x) > 0 when g(x) ^ 0. 
Then 




g(X) 
MX) ’ 


(4.2.21) 


where the random variable X is distributed according to J x (x). 
Let us assume for simplicity 


M x ) 


0 , 


if a < x < b, 
otherwise; 


(4,2.22) 
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then 

£[*(*)] (4-2.23) 

and 

I-(b-a)E[g(X)]. (4.2.24) 

An unbiased estimator of 1 is its sample mean 

0 2 -(b-a)-jj2g(X ( ). (4.2.25) 

The variance of 0 2 is equal to E(0 2 ) — [E{9 1 )\ 2 , so that 

var 9 2 = var -^(6 - a) 2 «( *,) = ^ (b - a) 2 J b g 2 (x)^-^ dx - 1 2 

-j; ( b-a)f b g 2 (x)dx-1 2 . (4.2.26) 

Sample- Mean Monte Carlo Algorithm 

1 Generate a sequence of N random numbers. 

2 Compute X, *= a + 17,(6 ~ a), / = 1, . . . , N. 

3 Compute g( A*,), / * 1 , . . . , N. 

4 Compute the sample mean 9 2 according to (4.2.25), which esti- 
mates I. 


4.23 Efficiency of Monte Carlo Method 

Suppose two Monte Carlo methods exist for estimating the integral /. 
Let 0, and 0 2 be two estimates produced by these methods such that 

E(0 } ) = E(0 2 ) = I. (4.2.27) 

We denote by /, and t 2 the units of computing time required for evaluating 
the random variables B x and 0 2 > respectively. Let the variance associated 
with the first method be var0, and that associated with the second method 
be var0 2 . Then we say that the first method is more efficient than the 
second method if 


f, var0, 

- — »• 
f 2 var0 2 


(4.2.28) 


Let us compare now the efficiency of the hit or miss Monte Carlo 
method with that of the sample-mean Monte Carlo method. 
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Proposition 4.2.1 var0 2 < var#,. 


Proof Subtracting (4.2.26) from (4.2.9), we obtain 


var0, — var0 2 = -^(fr — a) cl — f g 2 (x)dx . 

/V J Q 

(4.2.29) 

Note that 


g(x) < c; 

(4.2.30) 

therefore 


cl — f g 2 (x)dx > 0 


and further 


var 0, - var 0 2 > 0. 

Q. E. D. 

Assuming that the computing times t x and t 2 for 8 } and 0 2 are approxi- 
mately equal, we conclude that the sample-mean method is more efficient 
than the hit or miss method. 

If var0, and var0 2 are unknown, we can replace them by their estima- 
tors 

, r n i 2 

2 *(*,)<*-«)-* 

(4.2.31) 

and then estimate by 


' ,sf 

i 2 S? 

(4.2.32) 

It is interesting to note that, estimating the integral by 0, and 0 2 , we do not 
need to know the function g(x) explicitly. We need only evaluate g(x) at 
any point x . 

4.2.4 Integration in the Presence of Noise 


Suppose now that g(x) is measured with some error, that is, we observe 
g( x i) ~g( x i) + *,> ' = 1,2, . instead of g> where e, are independent 

identically distributed (i.i.d.) random variables with 

E(e) * 0, var (e) *= o 2 

(4.2.33) 

and 


je( < k < oo. 

(4.2.34) 
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Let (X, L) be a random vector distributed 

' 1 

fx.r( x >y) = 


a < x < b, 0 <y < c t 
otherwise, 


c,(t>-a) ’ 

0 , 

where 

Cl ^ g(x) + *• 

Then, by analogy with 9 ] for the hit or miss method, we obtain 


0, — Ci(b - a) 


N» 

N ' 


(4.2.35) 


where N H is the number of hits, that is g( X t ) > Y ( , i — 1, . . . , jV. By analogy 
with 9 Z ^ or sample-mean Monte Carlo method with 


we obtain 


/*(■*) “j b-a’ 

io, 


a < x < b 
otherwise, 




a) 2 «(*<)• 


/- 1 


(4.2.36) 


We can show that both r.v/s 0, and 9 2 are unbiased and converge almost 
surely (a.s,) and in mean square to / and that the sample-mean method is 
again more efficient than the hit or miss method. 


43 VARIANCE REDUCTION TECHNIQUES 

Variance reduction can be viewed as a means to use known information 
about the problem. In fact, if nothing is known about the problem, 
variance reduction cannot be achieved. At the other extreme, that is, 
complete knowledge, the variance is equal to zero and there is no need for 
simulation. Variance reduction cannot be obtained from nothing; it is 
merely a way of not wasting information. One way to gain this information 
is through a direct crude simulation of the process. Results from this 
simulation can then be used to define variance reduction techniques that 
will refine and improve the efficiency of a second simulation. Therefore 
the more that is known about the problem, the more effective the variance 
reduction techniques that can be employed. Hence it is always important 
to clearly define what is known about the problem. Knowledge of a 
process to be simulated can be qualitative, quantitative, or both. 
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43.1 Importance Sampling 

Let us consider the problem of estimating the multiple integral* 

J = fg(x)dx, xeDcR". (4.3.1) 

We suppose that gEL 2 (x) (in other words, that jg 2 {x)dx exists and 
therefore that / exists). 

The basic idea of this technique [14] consists of concentrating the 
distribution of the sample points in the parts of the region D that are of 
most “importance” instead of spreading them out evenly. By analogy with 
(4.2.20) and (4.2.21) we can represent the integral (4.3.1) as 



Here X is any random vector with p.d.f./ x (x), such that /*(-*) > 0 for each 
x E D c R n . The function /*(*) is called the importance sampling distribu- 
tion. It is obvious from (4.3.2) that ? = g{X)/f x ( X) is an unbiased estima- 
tor of /, with the variance 

var fr ■= f dx - I 2 . (4.3.3) 

J fx(x) 


In order to estimate the integral we take a sample X x ,...,X N from p.d.f. 
f x (x) and substitute its values in the sample-mean formula 


, =1 y jW 
3 ./*(*,)' 


(4.3.4) 


We now show how to choose the distribution of the r.v. X in order to 
minimize the variance of f, which is the same as to minimize the variance 
of 8 y 


Theorem 433 The minimum of varf is equal to 


var C 0 = 




(4.3.5) 


♦Formula (4.3.1) is a Lebesque integral and it is assumed that the domain of integration is 
bounded (has finite measure). Readers not familiar with Lebesque integrals may assume it to 
be a Riemann integral. 
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and occurs when the r.v. X is distributed with p.d.f. 


/*(*) 


ig(*)i 

/ \g(x)\dx 


(4.3.6) 


Proof Formula (4.3.5) follows directly if we substitute (4.3.6) into (4.3,3). 
In order to prove that varf 0 < varf it is enough to prove that 

[f\g(x)\dx j < j dx, (4.3.7) 

which can be obtained from Cauchy-Schwarz inequality. 

Indeed, 

Q.E.D. 

Corollary If g(x) > 0, then the optimal p.d.f. is 

7^ (4 - 39) 

and var f *= 0, 

This method is unfortunately useless, since the optimal density contains 
the integral / |(g(x))|*£x, which is practically equivalent to computing /. In 
the case where g(x) has constant sign it is precisely equivalent to calculat- 
ing /. But if we already know /, we do not need Monte Carlo methods to 
estimate it. 

Not all is lost, however. The variance can be essentially reduced if f x {*) 
is chosen in order to have a shape similar to that of \g(x)\. When choosing 
f x (x) in such a way we have to take into consideration the difficulties of 
sampling from such a p.d.f., especially if |g(jc)| is not a well behaved 
function. In estimating the integral, we can save CPU time if the sample 
X v ... y X N will be taken in the subregion D r = {x :g(x)^ 0} of D. This is 
the same as defining 

fx(x) > 0, if g(x) ^0 and /*(*) * 0. if g(x) =0. (4.3.10) 
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Consider the problem of choosing the parameters of the distribution 
J x (x) in an optimal way. We assume that the p.d.f. f x (x) is determined up 
to the vector of parameters a, that is, f x (x) =/*(*, a). For instance, if 
f x (x) represents one-dimensional normal distribution, that is, N(\i,a 2 \ 
then the unknown parameters can be the expected value j i and the 
variance a 2 . We want to choose the vector of parameters a to minimize the 
variance of 0 3 , that is. 


minvar 

a 


a .1 v 
3 



r g 2 Q 

' J x {x,a) 


dx-I 2 . 


The last problem is equivalent to 

(4.3.11) 

. c s 2 (x) ^ 

min f / \ dx ‘ 
« J M*>«) 

The function 

(4.3.12) 

/ * 
^ fx( X ’ a ) 

(4.3.13) 


can be multiextremal and generally it is difficult to find the optimal a. 
Some techniques for global optimization are discussed in Chapter 7. 


4.3.2 Correlated Sampling 

Correlated sampling is one of the most powerful variance reduction 
techniques. 

Frequently, the primary objective of a simulation study is to determine 
the effect of a small change in the system. The sample-mean Monte Carlo 
method would make two independent runs, with and without the change in 
the system being simulated, and subtract the results obtained. Unfor- 
tunately, the difference being calculated is often small compared to the 
separate results, while the variance of the difference will be the sum of the 
variances in the two runs, which is usually significant. If, instead of being 
independent, the two simulations use the same random numbers, the 
results can be highly positively correlated, which provides a reduction in 
the variance. Another way of viewing correlated sampling through random 
numbers control is to realize that the use of the same random numbers 
generates identical histories in those parts of the two systems that are the 
same. Thus the aim of correlated sampling is to produce a high positive 
correlation between two similar processes so that the variance of the 
difference is considerably smaller than it would be if the two processes 
were statistically independent. 
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Unfortunately, there is no general procedure that can be implemented in 
correlated sampling. However, in the following two situations correlated 
sampling can be successfully employed. 

1 The value of a small change in a system is to be calculated. 

2 The difference in a parameter in two or more similar cases is of more 
interest than its absolute value. 

Let us assume that we desire to estimate 

/, - / 2 , (4,3.14) 

where* 

/* * j g\(x)f l (x)dx t x^D } cR n (4.3.15) 

and 

h = f g2( x )M x ) dx < xGD 2 CR”. (4.3.16) 

Then the procedure for correlated sampling is as follows: 

1 Generate X x ,., t> X N from /,(*) and Y N from / 2 (x). 

2 Estimate A 1 using 

2*»W-4 2a ( . ( 4 - 3 » 7 ) 

r — l i * 1 i * 1 

where 

The variance of A 9 is 

o 2 = o, 2 + o 2 — 2cov(0,,0 2 ), (4.3.18) 

where 

if 42i.W (4.3.19) 

' /*) 

* 2 -4Ss 2 U) (4.3.20) 

' 1-1 

o 2 = £((?, - /,) 2 (4.3.21) 

ol^E{9 2 -l 2 f (4.3.22) 

•Introducing $(*) * $(x)/f x (x), where f x (x) is a p.d.f., integral /- $$(x)dx can be 
written as / - J g(jc)/(jc)*ix. An unbiased estimator of the last integral is 

D4-S<*> (4.3.13) 

and the integral can be estimated by 

! N 

»*-jj 2 *(*.)• 


(4.3.14) 
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and 

CO v(M 2 ) = - A )(»2 -/,)]. (4.3.23) 

Now if 0, and 0 2 are statistically independent, then 

cov(ff,,0 2 ) = 0 (4.3.24) 

and 

o 2 °= of + of . (4.3.25) 

However, if the random variables X and Y are positively correlated and if 
g,(jc) is similar to g 2 (jc) * n shape, then the random variables 0, and 0 2 will 
also be positively correlated, that is, cov(0,,0 2 ) > 0, and the variance of A0 
may be greatly reduced. 

Thus the key to reducing the variance of A0 is to insure positive 
correlation between the estimates I x and / 2 - This can be achieved in several 
ways. The easiest way is to obtain correlated samples through random 
number control. Specifically, this can be accomplished by using the same 
(common) sequence of random numbers U V ..*,U N in both simulations, 
that is, the sequences X t> ,.. y X N and Y l 9 ,..,Y N are generated using 
X t =* F{~ *((/,) and Y i * F 2 ~ \U i ) f respectively. Clearly, if f x is similar to / r , 
the r.v.’s X i and Y i will be highly positively correlated since they both used 
the same random numbers. 

It is difficult to be specific as to how random number control should be 
applied generally. As a rule, however, to achieve maximum correlation 
common random numbers should be used whenever the similarities in 
problem structure will permit this. Such an example is given in Section 
6.7.2, while comparing some output parameters of regenerative processes. 

433 Control Variates 

The use of control variates is another technique for reducing the vari- 
ance. In this technique, instead of estimating a parameter directly, the 
difference between the problem of interest and some analytical model is 
considered. 

Application of control variates is very general [10, 12, 13]. Most of them 
concern queues and queueing networks (see Sections 4.3.13 and 6.7). Our 
nomenclature follows Lavenberg and Welch’s paper [13]. 

A random variate C is a control variate for Y if it is correlated with Y 
and if its expectation \i c is known. The control variate C is used to 
construct an estimator for /i that has a smaller variance than the estimator 
Y. For any ft 


Y({i)=Y-P(C-it c ) 


(4.3.26) 
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is an unbiased estimator of p. Now 

var[y(0)]=var[y] - 20cov[y,C] +0 2 var[C]. (4.3.27) 

Hence if 

20co v[y,C] >0 2 var [C], 

variance reduction is achieved. The value of fi that minimizes var[T(/?)J is 
easily to be found as 

cov[r,c] 

v ar[C] 

and the minimum variance is equal to 

var[y(/?*)]=0 ~Pyc) var[y], (4.3.28) 

where p YC is the correlation coefficient between Y and C. Hence the more 
C is correlated with T, the greater the reduction in variance. 

Another type of control variate is one for which the mean E(C) is 
unknown but is equal to pi, that is, E(C) - E(Y) = pi. Any linear combina- 
tion 

Y{p)=pY+{ 1 — P)C 

is again an unbiased estimator of pi, and if Y and C are correlated, 
variance reduction will be achieved. 

We now extend the above results to the case of more than one control 
variate. Let C * (C p . . . , Cq) be a vector of Q control variates, let pi c be 
the known mean vector corresponding to C, that is, fi c =* (ji it . . . 
where p q — E[C q \ , and let 0 be any vector. Then 

K(0)= Y-p'(C-ii c ) (4.3.29) 

is an unbiased estimator of ji. Here t is the transpose operator. The vector 
0* that minimizes var[T(0)] (see [13]) is 

^-"rcSc 1 * (4.3.30) 

where 2 C is the covariance matrix of C and <j yc a (^’dimensional vector 
whose components are the covariances between Y and C^’s. The resulting 
minimum variance is 

var[n0*)]=O - * 2 rc) v ar[T], (4.3.31) 

where 

cr'yc^-c 1 °yc 
* >c = var[ Y ] 


(4.3.32) 
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As before the larger the multiple correlation coefficient R\ c between C 
and Y, the greater the variance reduction. 

Again, if Y u ...,Y e+ , are Q+ 1 different unbiased estimators of un- 
known fi, then 

iV, (4.3.33) 

i- 1 

where “ 1 is aiso an unbiased estimator of /i. 

For practical application of control variables there are two key prob- 
lems. First, control variables must be found that are highly correlated with 
the estimators of interest. Second, since the vector a YC and the matrix 2 C 
are in general unknown, the optimum coefficient vector p* is unknown 
and must be estimated. Further, its estimation must be incorporated into 
effective statistical procedures, and we now turn our attention to these 
questions. 

Let Y k , k = 1, . . . , K y be a sample from / K (y). An unbiased estimator of 
is 



The variance of Y is equal to 



and is estimated by 

, 4 . 3 . 34 ) 

The random variable 

(F~m) 

•5(F) 

has approximately a /-distribution with K - 1 degrees of freedom. The 
confidence interval can be found from 

prob{F-/*_,(l— !)<7(F)<,»< F+/*_,(l-f)d(F)}«l-a. 

(4.3.35) 

Let C k be the value of C for the k th run. Then if the optimum 
coefficient vector P* were known, we would use the estimator 

Y k W)=Y k -F'( C*-Mc) 


(4.3.36) 
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for the k th replication. The estimator based on K runs would be 


and a confidence interval could be obtained by replacing Y and Y k with 
Y(P‘) and Y k (p*), respectively, in (4.3.34) and (4.3.35). In this case (fi* 
known) 


q 2 Q'(P*)) 

o 2 (Y) 


— 1 R 2 rcr 


(4.3.37) 


and the variance reduction given by (4.3,31) would be obtained. Further- 
more, the ratio of the mean confidence interval widths would be approxi- 
mately proportional to the ratio of the standard deviations, and hence 
confidence interval width would be reduced by approximately (1 — R\ c )' /2 . 

However, in practice fi* is unknown and hence must be estimated. We 
estimate it by the sample equivalent of (4.3.30), that is, by 




(4.3.38) 


where v rc and 2 C are the sample covariance vector and sample covariance 
matrix whose elements are given by 

(*«:)*- (*r l) “ F X C * - c<) 

and 

“ ( jfTT ) ^ ( C 9* ~~ ^*)(^>* ~ *■>)» 

where C qk is the qth element of C k and C q is the average of 
k •= i, . . . , K. Substituting 0* for 0* in (4.3.3.6), we obtain 

Y k -p'{C k -p c ) 

and 

In general, Y(0*) is a biased estimator of j u since 0 * and C are 
dependent. Also, the Y k (0*) are dependent, so we cannot directly use the 
f-statistic to obtain a confidence interval for ft. However, if we assume 

Z— | ^ j to have a multivariate normal distribution, then it is shown in 
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[13] that Y(P*) is an unbiased estimator of fi and 

Y(fh - j* 

d(F(r)) 


(4.3.39) 


has a {-distribution with AT- Q~ 1 degrees of freedom. Hence a confi- 
dence interval can be obtained from 


prob{r(r)-r*_ c . l (l-§)d(r(0*)) 

<IL< F(^) + /*_ <? _,(l-f)tf(F(4*))}-l-«. (4.3.40) 


Further, the ratio o 2 (Z)( Y(0*))/a 2 (Y) is given [13] by 


o 2 (Y) 


=^-( 7 ^ 1 ) 0 (4.3.41) 


We can see from (4.3.41) that there exists a trade-off between (AT — 
2 )/(K — Q- 2) and 1 - /?£ c . At one extreme, if AT is not large with respect 
to Q , the factor (K - 2)/(tf - Q - 2) can nullify the potential variance 
reduction. At the other extreme we expect the factor 1 — Ry C to be a 
decreasing function with respect to Q . It was indicated in [13] that for 
finite AT the number of control variates Q has to be relatively small. It 
would be interesting to find the optimal Q as a function of AT by making 
some assumptions about R rc . 

The major cost involved in the application of control variables is the 
effort required to develop a reasonable set of control variates. This requires 
understanding the model in sufficient detail to define possible control 
variables and estimators of interest. 

There are only a few published reports describing the application of 
control variables for practical problems. However, judging from them we 
hope that variance reduction in the range 0,25 to 0.75 could be realized in 
practical situations. 

Now we consider how the control variates can be used in estimating the 
integral 


/»£[$(*)] « Jg(x)f x (x)dx. (4.3.42) 

Let g 0 (x) be a function that approximates g(x ) well and let the expecta- 
tion £[g 0 (x)] be known. The function g 0 (jc) is a control variate for g(x). 
Denoting Y = g(x), C = g 0 (x), and Me 555 fgoWfxWdx, we have for 
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any 


Y(p)=Y-l}{C-p c ), 


which is an unbiased estimator of the integral /. 

Taking a sample X^...,X N from f x (x), we can estimate the integral 1 
by 

/v i- 1 


where ft* is the optimal /?, which minimizes var[T(/?)J. The efficiency of 
this technique depends on how well g 0 (x) approximates g{x). But it is 
sometimes difficult to find a g 0 (x) that approximates g(;c) well enough and 
such that E[g 0 (x)] is known. 

In many cases no approximation is known for g(x). This can be 
overcome by simulating some values of X (making a pilot run) and plotting 
the results. 

The extension to the case of Q control variates (see (4.3.29)) in calculat- 
ing the integral 1 is as follows. Let £(X) = A'), . , . AT)] be a vector 

of control variates, with known mean vector that is, /ji v = E[$ q (X )]. 
Then for any vector 

Y(P)=g(X)-p(<HX)-vi.) (4.3.43) 

is an unbiased estimator of \x. Denoting Y ** g( A' ),<#>( X) » C, fi + « ji c , we 
obtain formula (4.3.29). 


43.4 Stratified Sampling 

This technique is well known in statistics (3). For stratified sampling we 
break the region D into m disjoint subregions D if / = 1,2, that is, 

D = u'l i D iy D k n Dj « 0, k^j where 0 is an empty set. Then define 

f g(x)f x ( x)dx 9 (4.3.44) 

which can be estimated separately by the Monte Carlo method (for 
instance by the sample-mean Monte Carlo). 

The idea of this technique is similar to the idea of importance sampling: 
we also take more observations (samples) in the parts of the region D that 
are more “important,” but the effect of reducing the variance is achieved 
by concentrating more samples in more important subsets D iy rather than 
by choosing the optimal p.d.f. 

Let us define 


P,=f D fAx)dx. 


(4.3.45) 
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It is obvious that ~ I and 


m m 

/= f g(x)fx(x)dx^ 2 f £(*)/*(*)<&= S A- (4.3.46) 

J D | • / 0, <- 1 


Introducing 


«(*)> ifxez), 

0, otherwise, 


(4.3.47) 


we can rewrite integral /, as 

/ _ f D-/-\fx( X ) _f n f _ 


h - / o !•,*(* ) ^ JjsAxy-^jr 1 dx = />£[ g,( A-)], 


where 


/*(-*) 


</x » 1. 


Inasmuch as /, is expressed as an expected value, the sample-mean 
estimator for L can be written as 


K - PM), 

where the r.v. X, is distributed according to f x (x)/ P i on D r 
The integral /, can be estimated by 


(4.3.49) 


’l-TT 2 $(**,). *,*» Ai.i'-l m (4.3.50) 


and the integral / by 


m m p 


*6 = 2 2 T7 2 *(**)• 


< I *» I t k , **» ) 


We may quickly verify that 


w p 2 m p2 2 

var *6 = 2 -TT varg(A' j )= 2 "TT"- 


W, 


^ At 

I- I 


where 


(4.3.51) 


(4.3.52) 


a, 2 =varg(.V,) = J r f g 2 (x)f x (x)dx - . 

r I * / D, 

If stratification is well carried out, the variance of 9 t may be less than 
the variance of the sample-mean method 0 4 with 2^,jV, — N. 
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Once the subsets /), are selected, the next requirement is to 

define the number of samples to assign to each interval. More specifically, 
let be the number of samples assigned to the subset D i where 

m 

2 N, = N. (4.3.53) 

i- I 

The following theorem tells us how to stratify in an optimal way. 


Theorem 43*2 For given partitioning D — u[li A 

min|var 0 6 = £ 

subject to 


occurs when 


and is equal to 


2^-iv 


jV = iV- 


P,a, 


2^ 


N 


m ]2 

2 • 

i« I 


(43.54) 


(43.55) 


(43.56) 


The proof of the theorem is left to the reader. 

Thus when the stratification regions are prescribed the minimum vari- 
ance of occurs when the are proportional to /^o,. 

This theorem, as well as Theorem 43.1, has no important direct applica- 
tion because the values of are usually unknown. 

One practical suggestion is to make a small “pilot” run to obtain rough 
estimates for a,. Such estimates would be of help in determining the 
optimal N t , with the appropriate trade-off between the cost of sampling 
and the degree of precision desired. 

Let us choose N^P^N (we assume that P % can be calculated analyti- 
cally). 


Proposition 43.1 var0 6 < var0 4 , that is, if the sample size j V. in each 
subset £>, is proportional to P t (i.e., if yv. = NP^ then the variance of the 
stratified sampling method will be less or equal to the variance of the 
sample-mean method. 
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Proof Substituting N t = NP t in (4.3.52), we obtain 

v ar0 6 = ^r 2 P,VKg(X,). 

From the Cauchy-Schwarz inequality we have 



(4.3.57) 


(4.3.58) 


m r 2 rn 


.-I r i I ,-l r i 

Multiplying (4.3.52) by P, and summing over i from 1 to m, we obtain 

2 /,varg(*,.) -f g 2 (x)f x (x)dx - 2 4r» (4.3.59) 

/-i J D i- 1 ‘‘ 


which together with (4.3.58) can be written as 

2 Pj var g( X t ) < f g 2 (x)f x ( x)dx- I 1 ** Nwat0 4 . (4.3.60) 

/-i j d 

Comparing (4.3.57) and (4.3.60), we immediately receive the proof of this 
proposition. Q.E.D. 


In other words, proposition 4.3.1 states: There is no function g(x) e 
L 2 ( Dj,f) such that the stratified sampling method would be worse than the 
sample-mean method while choosing = /> N, Of course, if the last 
assumption is not true, the stratified sampling method may be worse than 
the sample-mean method. In exercise 6 such an example is presented. 

It can be proven that the efficiency of stratified sampling in comparison 
with the sample-mean method is approximately m 2 . In the particular case 
when * I (m and N t * N/m y we obtain the so-called systematic sampling 
method (8). 

The procedure for systematic sampling is as follows: 

1 Divide the range [0, 1) of the cumulative distribution into m intervals 
each of width l/m. 

2 Generate {U kff k t - 1, . . . , N/m; i * 1, . . . ,m] from %(0, l). 

3 Y k «— (/ - I + U k )/m; k t — I, . . i = 1, . ..,m. 

4 X^F~\Y k ). ' 

The estimator for the integral / is 

, m N/m 
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and the sample variance is 



where 0 k = (l/m)2*L ,*(**> 


43.5 Antithetic Variates 

This technique is due to Hammersley and Morton [11]. In this technique 
we seek two unbiased estimators Y r and Y" for some unknown parameter / 
(in our case / is the unknown integral), having strong negative correlation. 
Note that \{Y' + Y") will be an unbiased estimator of / with variance 

var[i(T + r')] var Y' +i var Y" + 1 cov( Y', Y"), (4.3.61) 

and it follows from the last equation that, if the covariance cov( Y\ Y" ) is 
strongly negative, the method of antithetic variates can be effective in 
reducing the variance. 

As an example, consider the integral 

/= f i g(x)dx r 

which is equal to 

/-i ('[g(x)+g(l-x)]dx. (4.3.62) 

l J o 

The estimator of I is then 

r-iO' + Y")-' 1 [g(U) + gO -(/)]. (4.3.63) 

Y is an unbiased estimator of /, because both Y r «= g(t/) and Y" = g( 1 — U ) 
are unbiased estimators of /.To estimate / we can take a sample of size N 
from the uniform distribution and find 

* 7 = 2 Tv 2 + * 0 ]- ( 4 - 3 64 > 

l 

The time required for one computation by (43.64) is twice that required 
by the sample-mean method. Therefore the estimator (4,3,64) will be more 
efficient than the estimator 0 2 (4.2.25) with a = 0 and b — 1 only if 

var Of < { var ff 2 , 


Proposition 43,2 If g(x) is a continuous monotonically nonincreasing 
(nondecreasing) function with continuous first derivatives, then 
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Proof Let us assume without loss of generality that N » 1. It follows from 
(4.3.61) that 

var ff 7 ^ ~ Jjg 2 (x) dx + -^ Jjg 2 (l - x)dx 


+ | J^'g( ■*)«(! -x)dx- I 2 

+ ^ J^g(*)gO - x)dx- I 2 . 


(4.3.66) 


Therefore 

2 var 9 1 - var0 2 ~ f g(*)g(! - x)dx - I 2 , 

J o 

The theorem will be proved if we prove 

f'g(x)g(\ -x)dx<J 2 . (4.3.67) 

y o 

Let us assume that g(x) is a monotonically nondecreasing function with 
continuous first derivatives (the proof when g(x) is nonincreasing is 
similar), such thatg(l) >g( 0). Let us introduce another auxiliary function 

*(*)** f X g( \-t)dt-xI (4.3.68) 

J o 

such that «f>(0) = = 0. The first derivative 

(4.3.69) 

is also a monotone function and > 0, <f>'0) < 0* Therefore <£(*) > 0, 
x G [0, 1], and obviously 

f '<t>(x)g'(x)dx > 0. (4.3.70) 

•'o 

integrating (4.3.70) by parts, we get 

f ,< t > (x)g(x)dx < 0, (4.3.71) 

•'o 

and substituting (4.3.69) into (4.3.71), we obtain (4.3.67). Q.E.D. 


More generally, let 


/=( g(*)fx(x)dx, x ER { . 

~ — sc 


(4.3.72) 


Then by analogy with (4.3.64) an unbiased estimator of I is 
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where 

Xt-F-'W) (4.3.74) 

X;-F-'(\-U t ) (43.75) 


and F x (x) is the cumulative distribution function (c.d.L) of X. The pairs X i 
and X ■ are, of course, correlated since the same random numbers U t , 
i = A, were used to generate both r.v/s X t and X\. Furthermore, 
these r.v.’s are negatively correlated and therefore 0 7 may have a smaller 
variance than 0 4 . 

Let us rewrite (43.51) for the case when the region £>=» (jc:x£[ 0, 1]). 
We have 


2 
/ - 1 


(«, 


N t 


) N ' 

— 2 *[«<-! + («|-0,-|)C^ ( ], 

/-I 


(4.3.76) 


where 0 = a 0 < a, < • • • < a m = 1, F i — a, — and is a sample from 
%(0, 1). Letting m “ 2, and denoting a, *= a, we get for (43.76) 

1 N 

2 {ag(a{/,,) + (l ~a)g[a + (l ~a)V J2 \}. (4.3.77) 


Let us now make dependent. Assuming U jS = U j2 = we obtain 
1 * 

= 2 {a«(«l/ > ) + (l -«)g[« + 0 -o)£/ y ]} (4.3.78) 

or, alternatively, assuming » 1 — Uj 2 ** Up we have 
I * 

= ^ 2 {«*(«{/,) + (1 -«)«[!- (I -«)Uj\) (4.3.79) 


It is easy to see that both and 9j arc estimates of the antithetic variates 
type. If a ~ then (43.79) reduces to (43.64). 

Consider now a case with two strata for (43.72). Assume the domain of 
j x (x) is broken up by x a into the ranges - oo < x < x a and x a < x < oo. 
By analogy with (43.79) an unbiased estimator of / is 

I N 

07 = ^ 2 [ag(* ( ) + (l-a)g(*;)] (4.3.80) 

I- l 

where 

X, = F\aU,) 

X; = F~'[a+( !-«)(/,]. 

In the particular case when a = | (4.3.82) reduces to (4.3.73). 


(4.3.81) 

(4.3.82) 
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We can try to obtain an a that minimizes var# 7 in (4.3,80). Generally, 
this problem is difficult to solve because 0 7 does need to be unimodal with 
respect to a. In Chapter 7 some techniques for multiextremal optimization 
are considered. 


43.6 Partition of the Region. 

In this technique [21] we break region D into two parts />«£>, u Z> 2 , 
representing the integral / as 

1 - f g(*)dx » f g(x)dx+[ g(x)dx. (4.3.83) 

J D J D X j D 7 

Let us assume that the integral 

/i = f g(x)dx (4.3.84) 

J o t 


can be calculated analytically, and let us define a truncated p.d.f. 

fx ( x ) t ^ 

JZTp’ ,frG£> 2 

otherwise 


h(x) = 


(4.3.85) 


0, 

where P= J D J x (x)dx. 

Formula (4.3.83) can be written as 


y = /,+ 


f g(x)dx 


= /,+/ Tr\ h ^ dx 

J D 1 h(x) 


(4.3.86) 


I, + E 

g(X) i 

»/, + (! - P)E 

g(X) 


h(X) j 


[au)J 


An unbiased estimator of / is then 

r- /, + (!-/») 


g(x) 

fx(X) 


(4.3.87) 


and the integral 1 can be estimated by 


" " ' * n \ 


i £ «(*,) 
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Proposition 4 3 3 var 0 t < ( 1 - P) var 0 3 . 


Proof We have from (4.3.4) that 

" var< W 7TT rfx 

J ofx(x) 


-/ 77^*+/ fTT*-' 2 

j dJx(x) Jd 7 /x(x) 
and, correspondingly, from (4.3.88) that 

r s 2 (*) /*(*) , 


S 2 (*) 


(4.3.90) 


Afvar^ * (1 - P) 2 J" 




l VM*)0-O 1 


Multiplying (4.3.90) by (I - P ) and subtracting (4.3.91), we obtain 
N[(l-/>)var0 3 -var0 8 ]-(l-P)f dx 


(!-/>)/*+ / g(*)rf* 


= 0-P)f jr\dx-(\-P)l 2 + {l-ltf. 

J D\JX\X) 


(4.3.92) 


Now introducing 


c> - 4 *m * - 7 -/Jaw < 4 ” 3 > 


we have 


Af[(l - P)var0 3 - var0 8 ] = (1 - P)C 2 + (P' /2 I - P~' /2 I X ) 2 > 0, 
and Proposition 4.3.3 is proved. Q.I 


As a result of the proposition, we find that this technique is at least 
(1 - P)~ 1 times more efficient than the sample-mean Monte Carlo method. 
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43.7 Reducing the Dimensionality 

This approach is due to Buslenko [21] and is sometimes called expected 
value . 

Let us assume that the integral 

I ~ f g(x)f x (x)dx, xeDcR n (4.3.94) 

J D 

can be represented as 

/= ( f )/(>'- z ^ dz < (4.3.95) 

J o, J o 2 

where 

y~(x ly ...,x c r )eO,C/i' 

and 

z = (x l+l ,...,x„)ED 2 cR n ~'. 

Assume also that the integration with respect to z can be performed 
analytically, that is, the marginal p.d.f. 

fr(y)=[ fr,z(y>*)dz (4.3.96) 

and the conditional expectation 

Ez[g(Z\Y)] = f g(y,z)f(z\y)dz (4.3.97) 

can be found analytically. 

It is obvious that 

I~[ E z [g(Z\Y)f y (y)}dy = E r E z [g(Z\Y)]. (4.3.98) 
J *>\ 

An unbiased estimator of / is 

T, 9 = £ z [g(Z|r)], (4.3.99) 

and it can be estimated by 

(4.3.100) 

^ t- 1 

where Y ir i = 1 N are distributed with p.d.f. f r (y). 
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Proposition 43.4 If integration can be performed analytically with re- 
spect to some variables, then the variance will be reduced, that is, 

var tj 9 < var 7j 4 1 (4.3.101) 

where tj 4 is the sample-mean estimator (see (4.3.13)). 

Proof The proof is quite simple. Denote V = g(T, Z). Now using the well 
known formula [17] 

var \zt y {E z (V\Y)} + £ r {var z ( K | T)} (4.3.102) 

and noticing that tj 4 ** V, tj 9 — E z [g{Z\Y)\ = E z (V\Y) y and £ v var z (K|T) 
> 0, the result follows immediately. Q.E.D, 

4.3.8 Conditional Monte Carlo 

If the problem under consideration is very complex — the sample space 
is complicated, or the p.d.f. is difficult to generate from — then it may be 
possible to embed the given sample space in a much larger space in which 
the desired density function appears as a conditional probability. Simula- 
tion of the large problem can be much simpler than the original complex 
problem and, despite the added computation required to calculate the 
conditional probabilities, the gain in efficiency can be quite high. 

This technique was developed by Trotter and Tukey [24]. Our nomen- 
clature follows Hammersley and Handscomb’s book [10]. 

Consider again the problem of estimating 

/* f g(x)f x (x)dx-E[g(X) ]. (4.3.103) 

Let D be embedded in a product space ft ** D x R. Each point of ft =* D x R 
can be written in the form z = (x,y ),«- where x e D and Let k(z) * 

h(x,y) be an arbitrary density function, let (f>(z) =<f>(x,y) be an arbitrary 
real function, both defined on ft, and let 

<Hx)=f<l>(x,y)dy. (4.3.104) 

J R 

We also assume that both h(z) and \p(x) are never zero. We may regard x 
and y as the first and second coordinates of z so that x is a function of z> 
which maps the points ft onto D. 

Let dz denote the volume element swept out in ft when x and y sweep 
out volume elements dx and dy in D and R , respectively. The Jacobian of 
the transformation z = (x,/) is 


(4.3.105) 
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We define the weight function 

w(z) - 


/*(*)*(*)$(*) 


'Hx)h(z) 


(4.3.106) 


Then we have the following identity; 


g{x)f x (x) 


Mx) 

g(x)f x (x)<f>(z) 


J <t>(x,y)dy 


■L 


DXR y\t(x)h(z) 


-h(z)dxdy 


(4.3.107) 




g(x)w(z)h(z )^ — 
DXR 3 ( 2 ) 


‘ f Q g(x)w(z)h(z)dz, 


from which we can see that 

1 == E[ g(X)w(Z)], (4.3.108) 

where X is the first coordinate of the random vector Z sampled from ft 
with p.d.f. h(z). 

The unbiased estimator of / is then of the form 

U,o = «(*M Z). (4.3.109) 


Both functions <p and h, and also the region R , are at our disposal; we may 
choose them to simplify the sampling procedure and to minimize the 
variance of the estimator 17 l0 . 

We now consider a particular case. Let h(z) be a given distribution on 
the product space ft *= D x R % and let f x { x) = f x (x\y Q ) be the conditional 
distribution of h{z) given y =y 0 . If we write P(y) for the p.d.f. of Y when 
Z = ( X, Y) has p.d.f. h(z\ we have 

h(z)dz =f x (x\y)P(y)dxdy, (4.3.110) 


and comparison of (4.3.106) and (4.3.103) gives 


/( 2 ) 


*(0 

fx(x\y)P(y)’ 


(4.3.111) 


In particular 


§(x,y o) 


hjx^yo) 

fx( x )P(yo ) 


(4.3.112) 


By eliminating J x (x) from (4.3.106) and (4.3.112), we get 
, ^»(-<.> , o)3(x,y)<j>(-<,>’) 

h(x,y)9(x,y 0 )P(y 0 )+(x)' 


(4.3.113) 
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This leads to the following rule. Suppose that Z = (X,Y) is distributed on 
B with p.d.f. h(z) = h(x,y); then 

nio m, g( x ) w (Z), 

where w(Z) is given by (4.3,113), is an unbiased estimator of the condi- 
tional expectation of g(X) given that Y =y 0 . Note that this rule requires 
neither sampling from the possibly awkward space D nor evaluation of the 
possibly complicated function /, and <p is available for variance reduction. 


43.9 Random Quadrature Method 

Ermakov [4] suggested a quite general method of Monte Carlo integra- 
tion based on orthonormal functions. We need some preliminary results 
before describing this method. 

Let <>,(*), i * 0, 1, . . . , m, be a system of orthonormal functions over the 
region D, that is, 

<*»*,>-/ <&(*)*,(*)<**= ( °’ ‘* J (4.3.114) 

J D {1,1 

and let 

m 

£ c,<|>,(*) (4.3. 115) 

i-0 

be an interpolation formula for a given function g(x). The problem is to 
choose c,, for a given set of points x d G D , in such a way that 

= «(-*,). <»0, (4.3.116) 

that is, at points x f we require coincidence in both the original g(x ) and 
the approximated function g m (x ). To find c, we have to solve the following 
system of linear equations with respect to c,: 

Ctft>o( x o) + f W > l(*o) + • • ■ + c m < t>m(*o)=g(*o) 

+ Ci'M*;) + • • • +c m $ m {x i ) =>g(*,) (4.3.1 17) 

C 0<M ) + c l<»l( X m ) + • • • + X„)=g(x„). 


Applying, for instance, Cramer’s rule, we find 

_ " t (XQ'X x m ) 

0 H>(A 0 ,A 1 ,...,A m ) ’ 

where 

^o( x o)’ ^l(^o)* ■ - • 

w( A 0 , . . . , X„) = <*>,>(*,),$,(*,), tjx, ) 

«#>o(*m)-<*> ■•><&■(*«) 


(4.3.118) 

(4.3.119) 
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is the (n + IX* + 1) determinant and H' g (x; 0 , x Xy , . . , x m ) is the correspond- 
ing determinant in which the first column vector <t> 0 (x) *= {<f> 0 (*o)» 
. , . is replaced by the right-hand side vector g(x) = {g(*o)’ 

g{x ,), . . , ,g(x m )}. With these results at hand let us consider the problem of 
calculating the integral 


= f*o(x)g(x)dx. (4.3.120) 

Substituting (4.3,1 15) in the last formula, we have 

4=/ 4»o (x)g m (x)dx^J <f> 0 (*)| 2 c,4>,(jc)jdxs«/o, (4.3.121) 

which is an approximation of / 0 and is called an interpolation quadrature 
formula [4] for / 0 , Taking into consideration the orthonormality condition 

(4.3.114) , we immediately obtain 

/*-c 0 . (4.3.122) 

Therefore the value of integral / 0 is approximately equal to the coefficient 
c 0 in the interpolation formula (4.3.1 15) and can be calculated by Cramer’s 
rule (4.3,118). 

Ermakov [4] suggested choosing the points x i e D in the interpolation 
formula (4.3.115) according to some probabilistic law rather than de- 
termining them in advance. 

Assuming that x l9 c l9 or both of them are random variables, they called 

(4.3.1 15) a random quadrature formula , which is a natural generalization of 
the same formula (4.3.115) with deterministic x- and c r They proved the 
following theorem. 


Theorem 4 33 Let 


wJ X ) 

, if*efi + crt m+ ' 


w(X) 

0, if X£B 0 cR m+ ' 
be a random variable distributed with 


/*(*) = 


1 


where 


and 


(m- hi)! 

Bq = {*’• *v(jc) =0} 
= {x : w(jc) t * 0} 


( x ), 


(4.3.123) 


(4.3.124) 
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Then 0 1} is an unbiased estimator of I 0 , that is, 

WiWo (4.3.125) 

with variance 


var0 n < 


fg 2 (x)dx - £ . (4.3.126) 


The proof of the theorem as well, as some generalizations and applications 
can be found in Ermakov’s monograph [4], 

This method offers great possibilities because of its general character. 
But it also has some weak points: first, we must define a set of orthonor- 
mal functions over the region D: second, we must find an efficient way of 


sampling X 0 , X X9 . . . , X m with joint p.d.f. 


I 


: lH-(x 0 ,X,,...,X m )] 2 . 


( m + 1 )! 

Even then computation of 9 U is generally no small matter, and therefore 
the random quadrature method seems to be of rather limited practicality. 


43.10 Biased Estimators 

Until now we have considered unbiased estimators for computing in- 
tegrals. Using biased estimators, we can sometimes achieve useful results. 
Let us estimate the integral 

/- f g(x)dx (4.3.127) 

J D 

by 

2 m) 

(4-3.128) 

2/«0 

I 

instead of using the usual sample-mean estimator 

* -1 V *<*«> 

3 NixJxW 

Here U is distributed uniformly in Z>, that is, 

*(«)-] V’ ifu(ED , K = f dx, (4.3.129) 

I 0, otherwise D 


and X is distributed according to j x (x). 



146 


MONTE CARLO INTEGRATION AND VARIANCE REDUCTION TECHNIQUES 


It is clear that E(Q n )^I> that is, 0 X2 is a biased estimator of I. Let us 
show that 0 n is consistent. To prove consistency let us represent 0 ]2 as a 
ratio of two random variables 0\ 2 and 0[ 2 , that is. 


where 


r V H 2 

a ,112, 

9 ' 2 *fa 1 N 

rj}2M) 

4-1 


2 g(u<) 
" #•- 1 


(4.3.130) 


(4.3.131) 


Further, 


/v »-i 


(4.3.132) 


= Jg(w)^w~/ (4.3.133) 

and 

£(^ 2 )* - jf{u)du =» 1 . (4.3.134) 

With these results in hand we conclude that 0\ 2 and 0{ 2 converge a.s. to I 
and 1, respectively, when N-*co y which also means that 


2 m) 

jr-J 

2/«0 


/, if f\g(x)\dx < 00 (4.3.135) 


and this shows that 0 )2 is a consistent estimator of /. The bias of 0 U follows 
from 

s ] r n 

2 gW) £ 2 m) 

E(6 n )=E ^ ]“ 7 - (4-3.136) 

2 M,) £• 2/(^) 
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One major advantage of this method is that the sample is taken from a 
uniform distribution rather than from a general f x (x) from which the 
generation of r.v.’s can be difficult (recall for instance that in importance 
sampling J x (x) has to be proportional to |g(x)|, and if g(x) is a com- 
plicated function, it is difficult to generate from /*(*)). 

Powell and Swann [20] called this method weighted uniform sampling. 
They showed that for sufficiently large N this method is times more 
efficient than the sample-mean method. 


43.11 Weighted Monte Carlo Integration 

Yakowitz et al. (27] suggested estimating the integral 

l “ f'g(x)dx 
* / o 


using the following Monte Carlo procedure: 

1 Generate t/, f/^ from %(0, 1). 

2 Arrange (/,, 6/^ in the increasing order U ar . . . , U {Ny 

3 Estimate the integral by 




+ > (4.3.137) 

/-0 


whereU (O) ^0 and = 1. They proved the following 

Proposition 433 Assume g(x) is a function with a continuous second 
derivative on [0, I], If {t/ (>) }*L, is the ordered sample associated with N 
independent uniform observations, then 


var 0 )3 = ” / ) 2 < 



( 43 . 138 ) 


where k is some positive constant. 

It is also shown in [27] that in the one-dimensional case var0, 3 = 
0(1 /A 4 ), which is much less than var0 3 = O(l /N) in the sample-mean 
Monte Carlo method and in the two-dimensional case var 0 13 = 0(1//V 2 ), 
which is bigger than var0 n in the one-dimensional case but less then 
var 0 3 *= 0(1 //V) for the sample-mean Monte Carlo method. Unfortunately, 
Yakowitz et ak’s method becomes inefficient as the dimensionality of x 


increases. 
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4J.12 More about Variance Reduction (Queueing Systems and Networks) 

In this section we consider two more examples of application of variance 
reduction techniques, which are taken from Refs. 29, 32, and 33. The first 
example is a single server queue (7//G/1, the second, a network. Some 
other examples of variance reduction with application to different prob- 
lems can be found in Refs. 28 through 46. 

(a) Single Server Queue GlIGi 1 [46] Consider a single server queueing 
system Gl/G/ 1, with a general distribution of service and interarrival 
time. We assume that, if an arriving customer finds the server free, his 
service commences immediately, and he departs from the system after 
completion of his service. If the arriving customer finds the server busy, he 
enters the waiting room and waits for his turn to be served. Customers are 
served on a first-in— first-out (FIFO) basis. 

Let denote the service time of ith customer who arrives at time and 
let Ai = ti — p i > 1, denote the interarrival time (the time between the 
arrivals of the ( / — l)th and /th customers). 

Assume that the sequences {£*,/> 0} and {A t J> I) each consist of 
i.i.d. r.v/s and are themselves independent. Let \i be the mean service rate, 
and let \ be the mean arrival rate, that is, 

£(S,) = n"' and £(/<,)■= X"'. 

The parameter p = \/p is called the traffic intensity and measures the 
congestion of the queueing system. The necessary and sufficient conditions 
for the system to reach steady-state position (to become stable) is p < I. 

To measure the performance of the system we can use the mean waiting 
time of the / th customer (time for arrival to commencement of service); the 
number of customers in the system at time /; the amount of time in the 
interval [0,/] that the server is busy; or the total number of customers who 
have been served in the interval [0,/]. As our measure of performance we 
take the mean waiting time of the /th customer and denote it by £(W / / ). 

We assume that customer 0 arrives at time / 0 = 0 and finds an empty 
system. The following recursive formula is well known [33]: 

"'o-O 

W t — max(M / / _ , — 4, + S # _,,0) = ( W t _ ,—>!,+ j) + , 1,2 

(4.3.139) 

Usually, for the Gl/G/ 1 queueing system it is difficult to find E(W t ) 
analytically and simulation may be used. In order to estimate E( W/) we 
run the queueing system N times, each time starting from r 0 — 0, obtain a 
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sequence of service times > 0, k * I, . . . , N) and a sequence of 

interarrival times {A ik , i > 1, k = 1 N), and estimate E{W ]) by the 

sample-mean formula 

^=4 2 W, kt (4.3.140) 

1 k-\ 

where W ik = ( W {i _ 1)Jk - A ik + S ( ,_ 1)t ) + , W 0k = 0. 

We now explain how the antithetic and control variates methods can be 
applied for variance reduction, thereby improving the efficiency of the 
simulation. Both methods are based on reuse of the same random num- 
bers. 

Antithetic variates . Let F,(x) be the c.d.f. of the interarrival time A i and 
let F 2 (x) be the c.d.f. of the service time S'. Let us generate two se- 
quences of random numbers (t$\ i > 0, k = 1, . . . , N) and {U s f\ i > 0, 
k 1,...,#}* and obtain two corresponding sequences A ik = Ff >) 
and I (^i? ) ) interarrival and service times. Introducing the 

antithetic sequences { 1 - U$\ i > 0, k » 1, . . . , N } and { 1 - lf£\ i > 0, k *= 
I,..., A'}, we can define another two sequences A\ k - F~\\ - U^) and 
S- k ** l (l "" U&) of interarrival and service times and estimate the 
mean waiting time E{W k ) by 

i N ur + w f 

»? 4> - -h 2 * ■ * . (4.3.141) 

*~r z 

where 

W ik = — A ik 4 S ( ,^ X)k ) = [ W Ki _ }) - Fj + F 2 *(£-^5— i)*) ] 

»*7* - [ - »)* - *v '( » - W) + f 2 -'( i - 1/® ,„)] + . 

Now 

var [ var W i 4 var If" 4 2 cov( , W[) ] 

= 27v [ var ^ + cov (^.^-)]- (4.3.142) 

By analogy with (4.3.65) we can conclude that the method of antithetic 
variates will be more efficient than the sample-mean method if 

var W^ A} <|var W fy (4.3.143) 

which means that cov(M^, W') is negative and |cov(W^, H^/)| >^var W r 
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Page [46J suggested estimating E(W) by 



A 


2 


w* + w; k 


(4.3.144) 


where W" k - [ I)A - F t ~ \U_£>) + Ff \U"2 l)k )) + . 

Comparing the estimates H^/^and WS A \ we can see that antithetic 
pairs 1 - and 1 - in W t S A} were replaced, correspondingly, by 

and in W ( {A) . 

Mitchel [45] proved that, for any i > 0, both estimators W/ A) and 
are more efficient than the sample-mean estimator. 


Control variates. It is suggested in Ref. 33 that 

C ( - ~ A t + C 0 - 0 (4.3. 145) 

be chosen as a control variate for tV { =* max(H ^„ x — A ( + IV 0 » 0. 

Table 4.3.1 presents vai^Ff') for different methods and for the 200th 
customer, based on 25 runs. 

The service time has an exponential distribution with mean — 1.111; 
the interarrival time is assumed to be constant and equal to unity, and at 
time t 0 =* 0 there are no customers in the system. We can see that the effect 
of variance reduction by the antithetic and control variates is substantial. 


(b) Networks 

i) Antithetic variates 

To illustrate the use of antithetic variates for networks, consider the 
network shown in Fig. 4.3.1, 

Suppose we wish to estimate the expected completion time of T = T x + T 2 
by simulation, assuming that T, and T 2 are independent. 

The procedure of using antithetic variates for estimating E(T) is 
straightforward and can be written as: 

1 Generate two sequences of random numbers {CA (t> , / * 1 , . . . , /V) and 

{l/ j (2 \i= 1 N}. 


Table 4.3. 1 var(W^) for Different Methods 


Method 

Sample-Mean 

Antithetic Variates 

Control Variates (/? * 1) 

var(H') 

10.678 

1.770 

1.427 


Source . Data from Ref. 33. 
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( 2 > 



F%. 43,1 Network (from Ref. 29). 


2 Compute T u = F t ~ T 2 , = F 2 ~ '(1/®X T'u = *0 ~ U t m ), and 

r 2 >F 2 -'d 

3 Estimate E(T) by 

= 1 v <Zk± ZkillllklZiil 


Let us assume that both Tj and T 2 are distributed exp(l). Then denoting 
7] = T Xi 4 Tji and T[ = 7^ 4 T 2| , we obtain 

va K^) = 4^ ( var [( 7 'l< + T 'u) + ( T li + 7 '2,)] } 

*= [var T u 4 var T,', 4 var T 2< 4 var T 2i (4,3.146) 

+ 2 cov( T u , T\, ) + 2 cov( r 2 , , T 2i ) ] 

= 0[4 + 2£'[(r l( - i)] + 2£[(r 2< - i)(r 2 ' - i>] 

- 4 ^ [ 4 + 2E{ ln(U" + l)[ln(l - !//•>) + I ] } 

+ 2£{(ln£' (2) + 1 )[ ln( 1 - t/, (2> ) + •]}] 



On the other hand, in the sample-mean method with 2N runs we have 

var(f)=0. (4.3.147) 

Thus the variance has been reduced by about one third. 

It can be proven by analogy with Proposition 4.3.3 that for any continu- 
ous r.v. T, and T 2 the method of antithetic variates is more efficient than 
the sample-mean method. 

This simple example has been chosen solely to simplify the presentation. 
The method of antithetic variates can be successfully employed for any 
more composed network. 


Control variates . Consider the network shown in Fig. 4,3.2. We are 
interested in finding E(T AB ), the mean completion time of the network. We 
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assume that all T iy i— I,..,, 10, are independent exponentially distributed 
r.v/s with the same mean = 10. Even in this case it is difficult to 
calculate E(T AB ) because of the “crossing” link of duration r !0 . It is 
suggested in Ref. 29 that the control networks be chosen as a subnetwork 
of the original complex network, formed by deleting links with low 
probabilities of falling within the critical part. Two such control networks 
are shown in Fig. 4.3.3 and 4.3.4: the upper and lower control networks, 
respectively. 

For these two control networks the mean completion times are available 
analytically. Table 4.3.2 presents simulation results for the expected value 
and the variance of the completion time for the network in Fig. 4.3.2. The 
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Table Simulation results for the Network in Fig. 43.2 



Sample 

Antithetic 

Control Variates 

Method 

Mean 

Variates 

Upper Network 

Lower Network 

Expected Value £(7^*) 

55.1 

54.1 

54.3 

53.8 

Variance var(7^*) 

6.2 


3.8 

3.1 


Source: Data from Ref. 29. 


following methods are considered: sample-mean, antithetic variates, and 
control variates, using both the upper and the lower control networks. The 
simulation results are based on 50 runs. It is clear that the degree of 
variance reduction depends on our skill in selecting the control networks, 
which is not an easy problem. 


EXERCISES 


1 Apply Chebyshev’s rule to find the minimal sample size N for which the 
following formula will hold: 

where 


l N 
i ■" 1 


and AT — 


2 Assuming that for sufficiently large N 

- — yv(0, i). 


find the confidence interval for I with the level of significance a. 

3 Prove Theorem 4.3,2. Hint: apply Bellman’s dynamic programming recursive 
equation, or Lagrangian multipliers. 

4 Let where fgi(x)dx and a i arc known coefficients. An 

unbiased estimator of / is 


n 


i-i 


g t (X) 
fx(X) ’ 


where f x (x) is a multidimensional distribution. 
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(a) Prove that var(Tj) is achieved when } x ( x) *= \Q( x)\/ / |£?(jc)| dx, 

where Q( x) — 27- 1 o f g,(x ) and is equal to 

varrj» [f\Q(x)\dx] -/ 2 - | f| £ (•*)!<** 

(b) Prove that, if X {y ...,X„ are independent, then min^ (jl) var (tj) is achieved 
when 




and is equal to 


where 


From Evans [5]. 

5 Consider the integral 


/*(*) £«?*,’<*)) 


1/2 


f 2 -2°W 




I - ( h g{x)fxU Odx, 

J a 


which can be estimated by both the sample-mean Monte Carlo method 

l-l 

and by the antithetic vanates method 

I * 

2 + J rfb + a-Xt)). 

where the sample A ',,/=* I,..., is taken from %(a,fc). By the assumptions of 
Proposition 4.3.2 prove 

var 0 7 <]var0 4 . 


6 Let m * 2, .V, » /V 2 ** N/2, in the stratified sampling method. According to 
Proposition 4,3.1, P, and P 2 =\ Prove that if we choose P x and P 2 then 
for any g( x) 6 Z, 2 (A\/), var0 6 > var0 4 , that is, the stratified sampling method is 
worse than the sample mean method. From Ermakov [4]. 

7 Prove by induction on m that 

/ • • * / x m )dx 0 ,dx 1 )!, 

m+ I 

where is defined in (4.3.119). From Sobol [22). 
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8 Find an estimator for 



assuming that the sample is taken from the exponential distribution /*(*) — \e ~ Ax , 
X ^ 0. Prove that, for £(jt) = cjC c > I. The minimum variance of the estimator will be 
achieved when X = kl(n + l). From Sobol [22). 

9 Let U be a random number and let X - aU + b and X’ * a(\ — U) + 6, Show 
that the correlation coefficient between A' and A" is equal to - I. 

10 Consider the following network: 

Assume that T i% i = 1, 2, 3, are i.i.d. r.v.'s distributed F r (t). Write two formulas for 
estimating the expected completion time E(T AB ) y using the following methods: 

(a) Sample-mean Monte Carlo method. 

(b) Antithetic variates. 


AO 


<Z>* 


11 Prove that while integrating in situation of noise (see Section 4.2,4) both 0, and 
6 2 converge a.s. and in mean square to I — f g(x)dx and that var 0 2 < var 0,. 

12 Let / « f g(x)h(x)dx - E{g(X)), where h(x) is a p.d.f. Let f x (x) be another 
p.d.f. An unbiased estimator of / is 

I v 

Prove that var (17) is achieved when 

|g(jt)|A(jt) 


/*(*> 


and is equal to 


j \g(x)\h(x)dx 


var - 


N 


f\g{x)\h(x)dx 


-/M. 


13 Show that the method of antithetic va nates is a particular case of the method of 
control variates. 
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CHAPTER5 

Linear Equations and 
Markov Chains 


In this chapter we show how Monte Carlo methods can be used to solve 
linear algebraic, integral, and differential equations. As a rule Monte Carlo 
methods are not competitive with classical numerical methods for solving 
systems of linear equations (some special cases where Monte Carlo meth- 
ods can be used are considered at the end of Section 5.1.3). We discuss the 
Monte Carlo methods, however, because they serve to introduce analogous 
Monte Carlo methods for solving integral equations. These methods are 
widely used, since numerical methods are not efficient in this latter case. 

This chapter is constructed as follows: In Section 5.1 we solve a system 
of linear equations and find the elements of the inverse matrix in the 
system by simulating discrete- time Markov chains. The problem of finding 
a solution of integral equations by simulating continuous-time Markov 
chains is the subject of Section 5,2. Finally, in Section 5.3 we construct a 
Markov chain for solving the Dirichlet problem. 


5.1 SIMULTANEOUS LINEAR EQUATIONS AND ERGODIC MARKOV 
CHAINS 

A Monte Carlo solution to a system of linear equations is based on one 
proposed by von Neumann and Ulam and extended by Forsythe and 
Leibler [4]. 

Let us consider a system of simultaneous linear equations written in 
vector form 

Bx~f, (5.1.1) 
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where the vector x' — (x v . .., x n ) is to be found and the matrix /? = Wb^W" 
and the vector /' * (/p . . * ,/ n ) are given; t denotes the transpose operation. 

Introducing I — A — B, where / is an identity matrix, system (5.1.1) can 
be rewritten as 

x«/4*4/. (5.1.2) 

Suppose 


max ^ \ a ij\ < !• (5.1.3) 

' 1 

Under this assumption we can solve (5.1.2) by applying the following 
recursive equation: 

x<* + i >-Ax ik} +f. (5.1.4) 

Assuming x° = 0 and A 0 = I, we have 

*<* + '>-(/ + ,* + • •• +A k ~' + A k )f 

- 2 (5.1.5) 

m — 0 


Taking the limit, for # nonsmgular, 

k 

lim x ik) == lim 2 *"/- (' ~ 7- *"V- *. (5.1.6) 

/C^OO m , 0 

we obtain the exact solution of x. The yth coordinate of the vector x* +l is 
equal to 


x a + ,) 


■j5 + 2«>.,A + 


2 

*\h 


4 


-f a, , • - - u. , / . 

h f tk 


(5.1.7) 


We also consider the problem of finding the inner product 

<A, x> *= h { x { 4 - • • + /r n *„, (5.1.8) 

where A is a given vector and x is a solution of (5.1.2). 

It is readily seen that by setting 

h* * (0, . . . , 0, 1,0,. ..,0) (5.1.9) 


we obtain Xy 


j 
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In order to solve (5.1.2) let us introduce an arbitrary ergodic Markov 
chain (M.C.) 


r=\\Pij\\l 


(5.1.10) 


n 


2 p , 


- 1.2 Pij='> 
J= I 


Pi > o, P tj > 0, l , 


such that* 

1 p f > 0, if h-^0 

2 f; v >0 if a„*=0, ij» (5.1.11) 


where p i and P (J are, respectively, the initial distribution and the transition 
probabilities of the Markov chain. 

We first consider the problem of estimation which ap- 

proximates <A,x). Let k be a given integer and let us simulate the Markov 
chain (5.1.10), (5.1.1 1) k units of time. We associate with the Markov chain 
a particle that passes through the sequence of states / 0 , /„ . . . , i k . 

Define 




a i i a i t ’ 
Vl l l l 2 




P P 

‘oh 'i'a 


which can be written recursively 


W m W 




m — 1 


'm-lU 


W 0 ~\. 


We also define the random variable (r.v.) 

= ^ 2 wj lm 

P‘ Q m ~0 


(5.1.12) 


(5.1.13) 


(5.1.14) 


associated with the sample path / 0 — > / 1 ^ • • —>/*, which has probability 

Pi 0 Pt Qt ^P l}l2 ' ’ ’ P, k Now we are able to prove the following 


Proposition 5.1.1 

E[ Vk (h)] = lh, 2^"/\ -<*,*<*+•>>, (5.1.15) 

that is, T] k (h) is an unbiased estimator of the inner product 


•The Markov chain need not be homogeneous; we are considering the homogeneous case for 
simplicity only. 
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Proof Each path • — *i k will be realized with probability 

i k)~‘P, a P i a iP„if ' ■ (5.1.16) 

While simulating the M.C. (5.1.I0)-(5.1.11), since the r.v. ij k (h) is defined 
along the path i 0 -»i ■ * ->/*, we have 

£ [ VkW] - 2 • • • 2 • • • P u->U> (51 -17) 

/ 0 "l 


which, together with (5.1.12) through (5.1.14), gives 

E[v^)] = 4^ 2 W mf>. 

\ Pi 0 m- 0 


= y • • • y /», y a » < a i i — a i i fi Pi i ■ ■ • Pi i • 

to “• ) — l m » 0 

(5.1.18) 

Using the property * 1, the last formula can be written as 

rt n 

£ [n*(fr)]~2 2 2 V<v, a «v. (5119) 

m-O/Q-l U = 1 

Taking into account that 


** = i k„<. ii r x ii r - 2 a i t i t a i,i J 


'o*' 2 “ t 


and 


2 •■• 2 

f i “ t U 




we immediately obtain 


E[ Vk {h)] = l^h, 2^'7^ = <A^ ( * +,, >. 


To estimate </i, x (fc+1) > we simulate jV random paths 
i[ s \ s * l, 2, . . . , /V, of length k each and then find the sample mean 


Q.E.D. 
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The Procedure for Estimating (A, jr r * + /> ) 

1 Choose any integer k>0. 

2 Simulate N independent random paths ► * 

1, ...» N y of the Markov chain (5-1-10)— (5.1.1 1), 

3 Find 

= 2 K’% *- \ N, 

P \ ? «-0 


where 


wi 


s) 






>;(*) 
"'A > 


5 = 


(5.1.21) 


(5.1.22) 


4 Calculate 


l 


yV 


•k-N 2 #>(*). 

I- 1 


(5.1.23) 


which is an unbiased estimator of the inner product 


Taking the limit of (5.1.15), we obtain 


Iim £T[ »?*(*)] = £[i7*,(A)] = <.h,x). (5.1.24) 

► oo 

Thus provided that the von Neumann series A + A 2 + * • • converges and 
the path / ^ — * i , — *■ * • * — > i k % • • is infinitely long, we obtain an unbiased 
estimator of (h,x). 

The sample-mean is then of the form 


where 


and 


/V 




s — I 


W) 


A, 


A, 


u> 


i 

m — 0 


I N 




a <i;l lil;’ 


(5.1.25) 

(5.1.26) 

(5.1.27) 


We note that the inner products (h, ~£ k m _ 0 A m f ) for different h and / can 
be found from (5.1.23) by using the same random paths if/*— >/ < / ) — » • • • i[ s> , 
*=■ 1 N of the M.C. (5.I.10)-(5.1.I1). 
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Remark In the particular case where A = aP, 0 < a < 1, we have W m = a m 
and 

A, £ 

laW-TT X « m 4- 

r *o m- 0 


5.1.1 Adjoint System of Linear Equations 

Let us define for the system of linear equations (5.1.2) an associated 
system of linear equations 

x*=A'x* + f i (5.1.28) 

where A’ = ||a' y |j" is the transpose of A. It is readily seen that 

</»,*>-<**,/>. (5.1.29) 

Indeed, we have from (5.1.2) and (5.1.28) <**, x> = < x*,Ax ) + < x *, /) 
and (x,x*) = (A'x*,x> + (h,x), respectively. Now (5.1.29) follows be- 
cause (A‘x*,x) = (x*,Ax). We call the pair (5.1.2) and (5.1.28) adjoint 
systems. 

A direct consequence of (5.1.29) is that there exists another unbiased 
estimator of <A,x>, which can be written as 

4 I (5.1.30) 

P 'o m-0 

where 

WS = \, and = 


are defined on the sample path / 0 — m, — ► • * • which is obtained from 
the Markov chain defined by the following: 

IT ft 

= ^>o, /j=i,...,n, 

»-i y- i 


1 A*>0, if/ ^ 0 

2. F*>0, if a)j # 0, i ,j = 1 n. 


such that 
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In the particular case for which P in (S.l.lO)-(S.l.ll) is a doubly 
stochastic matrix, that is, 

£/v=l and 2 #■„-!, (5.1.31) 

j - i <-> 

P* can be chosen equal to P'. Assuming also A' = A, then together with 
(5.1.31) we obtain P* = P, and (5.1,30) becomes 

VW 2 W m h im . (5.1.32) 

P* o m — 0 

Comparing (5.1.14) with (5.132), we can see even in this case, that is, when 
A'=A and P* ® /*, ¥= rj k (h). The difference between ijJ(/i) and 

ri k (h) is in terms of f io and h t 9 which are interchanged. 

We return now to the original problem (5.1.2) of estimating all coordi- 
nates Xj of the vector x. In order to estimate the yth coordinate x y of x we 
assume 

h' = ej = ( 0 0 , 1 , 0 ,..., 0 ) 


j 

and start simulating the M.C. from the state y, that is, />, o ==/>, == 1. The 
corresponding path is then y— > /, -W 2 “* ’ * ’ 

Denoting 

2 (5-1-33) 


where 




a ti U: i ■ 

J* i h*! 


P P 

Jh V? 





we immediately obtain the corollary 
and also 

= ^ 2 1 7* ) (*,)~*y* +,) - 


J«I 


(5.1.34) 

(5.1.35) 


It follows from (5.1.33) that, in order to estimate all the components x y , 
1,2, of the vector x, we have to simulate n random paths 

y— ^/| — »i 2 — » - * • -*/*,y = 1,2 n, of the Markov chain (5.1.10)-(5.1.1 1), 

each time starting from a new state i 0 = j , 
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Looking carefully at (5.1.33), we find that all t 1,2, are 

similar They differ only in the initial terms and which are 

associated with the choice of the initial state i 0 . Thus for T> A (e y ) and r} k (e r ) 
we have a ji% / P ji{t fj and respectively. 

We now turn to the question of whether or not all the components x, of 
x can be estimated simultaneously by simulating one path. The answer is 
affirmative. We start this topic with the following 

Definition The path / 0 — ► /» — • ■ • -^/ r will be called covering if it has 
visited each state j = 1 n at least once. 

Let — * f'i — * * ' ’ * * be an infinite realization from the Markov 

chain (5.L10)-(5.1 . 1 1). Because our Markov chain is ergodic, each state 
will be visited infinitely many times and the first hitting time to the state 
j y Tj = min{f : i t ~y} is finite almost surely (a,s.). With this result in hand 

the procedure for finding all the estimates Tj A (e,),y= 1,2 n, from one 

realization can be written in the following way: 

1 Simulate a covering path 


f q — ► 1 1 — ► * * * — > i — * * * — ► i j — > - * • — ► / ^ A , (5.1.36) 

where T = max y (7J) = maxmin y {/ : i t =j}>j =* L . - . , n, and k is some fixed 
number. 

2 Find the first hitting time 7* * min{f : /, **j) for each state j = 
I, . . . , n, separately. 

3 Take the subpath i T ♦/,- (which is the part of the 

generated path) for each state j = l r . . . ,n, separately. 

4 Calculate all 

T'+k 

*<*,)- 2 W m fi m * y-l " (5.1 .37) 


where 




l***r t + 


P P " ' P 


and 


= 1 ( 5 . 1 . 38 ) 


are defined on the same path (5.1.36) starting at different points 7* associ- 
ated with the first hitting time. Each subpath i T -+i T -+* • • — 
is of the same length k. Thus / 0 — ► / 1 — ► * ' * — *r r wil! be a covering path of 
minimal length (in a given realization). 
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5 Simulate N such independent random paths 




;<Q 




►/#>- 


+k 


and find 

«*(*,)—= y- l «. (5.1.39) 

yv S- I 

which estimates x jr 

Therefore all r.v/s i\ k (ej\j — 1*. - are defined on the same path and 
calculated according to the same formula (5.1.37). The only difference 
between them is the starting point, which is determined by the first hitting 
time Tj and is a random variable. 


Proposition 5.1.2 

£[ifc(e,)]=*j* +, >. (5-1.40) 


Proof The proof of this proposition is based on the strong Markov 
property, which is given in Ref. 2, Proposition 1.22, p. 117, which states 
that for any homogeneous Markov chain and any bounded function g 
defined on the state space, we have 

+ 1 » * " ♦ )|^ ""7 ] g('Wi» * * • )l'o j • 

fn our notations 

£[ Vk(ej)\Tj - /] = £[ *(»,.»,♦ i, )l'< =>] * £[ 8('o >k )l»o -y] • 

By Proposition 5.1.1 £[g(/ 0 ‘*)l'o “/) “ x}** 0 - Since E[i\ k (ej)\Tj ■ = t] 

does not depend on t, we have E\ii k (ej)\7^ — /] = E[ij k (ej )] "» Xj k + 1) . 

Q.E.D. 


Corollary Um*_ #0 £[^*(«r > )) = 

Proposition 5.1.3 

var[* A (e,)]-var [*,<e,)]. (5.1.41) 


var 


[*.(*>)] 


var 




5=1 


^{f[^(e ; )] 2 -[^* +,) ] 2 } varr,*(e y ). 


Proof 
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Similarly, 

f . N 

var[ 9 k ( <?,)] = var — 2 ) 

- Jf { £ [ ^ e J > ] 2 " [ *>* + 'T ) " var e J > ’ 

Now again using Proposition 1.22 of Ref. 2 (p. 117), we have 

E { [ i)k{ e j)] 2 \ T j = '} = E i i(h w)l'i = j] 


=s E[g(i 0 ,. 

Therefore var \9 k (ej)] « var \6 k (€j)\. 


Q.E.D. 


To compare the efficiencies of the two methods we use (4.2.28), which 
can be written 


tvar9 k (ej) 

E — ~ , 

ivar$ k (ej) 


(5.1.42) 


and assume without loss of generality N ~ 1. Since var 9 k (€j) * var £*(*,.)• 
we have c — ///. In the first case we have n trajectories each of length k , so 
the total length of these trajectories is nk. In the second case we have one 
trajectory of length max y „ { n {Tj] + k> with mean E(max,_ 1 *{7}}) + k . 

It is obvious that the second algorithm is on the average more efficient 

when rt > I and k »(« - 1)~ l E[T=* max y _ , Because the first 

hitting time T jr j « I, , . . , n, to each state is finite a.s., it can be proven that 

£ — a ,s. as£— ►«>, (5.1.43) 

l n 


that is asymptotically the method of covering path is n times more efficient 
than the standard Monte Carlo method. 

The efficiency of the second method can be improved if we can find 
i 0 = / such that 


E max { 7} } | / o = f = min £[max{ Tj) |/ 0 = /] 

and then take this i 0 = l as a starting point of the path or, equivalently, 
choose the initial distribution as 



if'o = / 
if i 0 ¥^l. 
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5.1.2 Computing the Inverse Matrix 

It follows from (5.1.6) that 

oo 

where B ' 1 = )| b~ 1 1| , => 1 + A + A 2 + • • • . The y'th coordinate of x is 

*, = 2 b j~r'Sr- 

r- I 

Setting 

/ r ' = e, = (0,... ,0,1,0,. ...0), (5.1.44) 

r 

we obtain 

Xj ~ b~\ (5.1.45) 

and the estimator in (5,1.33) becomes 

2 W m . (5.1.46) 

"i/'m -r 

Here the summation with respect to W m is taken over the indices i m *> r, 
that is, when the particle visits the state r. 

The sample mean is then 

(5i«) 

I — l 

where 5 =* l t 2, . . . , TV is the path number. 

Thus setting 

W » hj — ej **(0 0, 1,0, ...,0) (5.1.48) 

j 

and 

0 ,.. ., 0 , 1 , 0 ,.. ., 0 , 

r 

we can estimate all the elements 1 of the inverse matrix B _1 by (5.1.47). 

Inasmuch as the problem of determining is a particular case of the 
problem of finding x Jt we can estimate all the elements b~ l of the j\h row 
of the inverse matrix B~ } simultaneously with Xj . Thus the Monte Carlo 
method provides a way of estimating a single element or any collection of 
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the elements of B~ x . This desirable feature differentiates the Monte Carlo 
method from other numerical methods in which, as a rule, all the elements 
of B~' are computed simultaneously. 

By solving the adjoint system we can estimate simultaneously all the 
elements bj~ l of the rth column of the inverse matrix B~ l . It follows also 
from (5.1.36) through (5.1.39) that all the elements b~ x can be estimated 
simultaneously with the Xj s from the covering path. 

Before leaving this section we want to turn the readers* attention to the 
analogy that exists in calculating integrals and solving systems of linear 
equations by Monte Carlo methods. 

Calculating the integral 


I = fg(x)dx, 

we introduce any p.d.f. f x (x) such that 


g(X) 

fx(X) 


where X is distributed with p.d.f. f x (x) and f x (x) > 0 when g(x) # 0. Then 
taking a sample N from /*(*), we estimate the integral / by (see (4.3.4)) 

M .i{ *<*,) 

While solving the system of linear equations we introduce any ergodic 
Markov chain (5.l.lO)-(5.1.l 1). Then simulating our Markov chain, we 
obtain the path with probability P(i 0 , . .. , i k ) = 

The element xj k + of the vector x ( * +,) can be written (see (5.1.7)) as 

y<*+ I) ae f 4 . y Q . f + a a f +**• 

•i 




2 ViA 

■<v 

*k 


a J± f p + 

v Q J'\ 

P V/'» + 
J'l 

" p 

i\'h S'* 

a a 

ct 'i* 

2 * ‘ * a U 

^ P P 

f k J*t * i 1 

v p >. 


-r, 






2 w Ji- 

[ 0 




(5.1.49) 
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where r) k is distributed according to , P ik i#v Here i 0 ~j and 

Pj = 

Now considering N random paths • * * -W* J \ = 

1, . . . , N> we can estimate xj A) by (5. 1 .39). 

Comparing (4.3.4) and (5.1.35), we realize that both problems of calcu- 
lating the integral and solving the system of linear equations can be 
reduced to the problem of estimating the expected value of some random 
function. In our case the random functions are g(X)/f x (X) and tj *( e y ), 
respectively. 

These results allow us to suggest a general Monte Carlo procedure for 
solving different problems, which can be written as: 

1 Find a suitable distribution associated with the problem. 

2 Take a sample from this distribution. 

3 Substitute the values from the sample in a proper formula, which 
estimates the solution. 

5.1.3. Solving a System of Linear Equations 
by Simulating a Markov Chain with an Absorbing State 

Another possibility of estimating (h,x} is by simulating a Markov 
chain with an absorbing state, as was suggested by Forsythe and Leibler 

[4]- 

^11 P\l P\ H P\» + I 

^2n Pln+\ 

; (5.1.50) 

Pm Pnn 1 

0 0 * • 0 1 

with 

n 

P 0 - °> = 1 . •••,«. Pi. n *i°8i - 1 - 2 p ij z 0 

/-i 

n 

=1. Pi so, 1,2,..., n, 

<=» I 

which is essentially an augmented (5.1.10) matrix. Here p i and P {J are, 
respectively, the initial and the transition probabilities. 

Assume also: 

1 p t > 0, if h i # 0, 

2 P tJ > 0, if a fJ *0 f ij = 1,2,...,/?. (5.1.51) 
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The state n + I is called an absorbing state of the Markov chain (5.1.50)- 
(5.1.51). It is well known (Cinlar [2]) that, if there exists a state i, 

i = 1 n , such that P ittt+ , > 0, then all the random paths / 0 — ►/, 

i (r) terminate in state n + 1 a.s. and the expected time of termination of 
each random path is finite, that is, E(v) < oc. 

We start to simulate our Markov chain (5.1.50)-(5.1.5l) by choosing the 
initial state / 0 according to the probability p t i 0 = 1, 2, . . . , n> where 
2, />, - 1. Consider now a particle that is in state / 0 . The particle either 
will be absorbed with probability in state ; 0 or will pass to another state 
/, with probability /^ . Generally, if at time m~ 1 the particle arrived at 
the state / m _ t , then it will either be absorbed from there with probability 
* or will continue along the random path to the next state i m with 
probability P im . . The random path # 0 ->#,->• • • — >/ (r) has probability 


p<p. 


'o'l # | r 2 


P s 


i 


where g, = / , 1>+ , = 1 - 2 ^ 
y-t 


is the probability of absorption from state /„. 

Consider any r.v. tj, which is defined on the parth i 0 -+i\ 
The expectation of tj is 

oo n n 

e(v)= 2 2 • 2 VAP io p , oh - • ■ 

k — 0 <j> I i* — } 


where T} k is defined on the path that terminates exactly after k units of 
time. 


Ut 





(5.1.52) 


where W k is the same as in (5.1.12). 


Proposition 5.1.4 

*[w*)] -<*•*>• (5.1.53) 

that is, 1 7 ( *)(A) is an unbiased estimator of the inner product t>, 
provided E(k) < o o. 


Proof The r.v. tj (< c) {/?) has the same probability as Tf ki that is, 
■ ■ p u. t , k 8u- Therefore ^ 

e [W*)]"2 2 ••• 2 W h )p,o p i oi P iih - ■ ■ 

k **0 *o “ t - I 

(5.1.54) 
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Subslituling (5.1.22) in (5.1.54) and taking (5.1.12) into account, we obtain 

*[ W *)]-2 £ £ ( 5 L55 ) 

<:-0 i'o“l #* *■ I 

Now comparing (5.1.55) with (5.1.19), we immediately obtain (5.1.53). 

Q.E.D. 


The procedure for estimating ()»,.*> is: 

1 Simulate N independent random paths / o’ 1 2 — » / j’ ’ — > 

1 N, from the Markov chain (5. 1 .50)— (5. 1 .51). 

2 Determine 




Pi 


,<*> 


»n (,) 


k 


5= i,...,yv, 


where W^ s) is the same as in (5.1.22). 
3 Estimate </i,x> by 


V) 


N 



s — I 




r(j) - — 
'(*)’ 5 


In the particular case where a 0 >0 and ^ wmX a iJ < 1, the matrix P in 
(5.1.50) can be chosen as 


°\ n Pln+\ 


® n\ &nn P nrt+ I 

0 — 0 1 


that is, p fJ 


a i}'> i>J In this case W k 


ik( h ) 


h l 

*0 Jt k 

Pio &*k 


1 and 


There are, however, few applications of these techniques. The reason is 
that the Monte Carlo method is not competitive with classical numerical 
analysis in solving systems of linear equations. Still, there are some 
situations where the Monte Carlo method can be successfully used: 

1 The size of the matrix A = ||a 0 ||" is sufficiently large (n > 10 3 ), and 
a very rough approximation is required. 

2 It is necessary to find </i, x> for different h and /, where x = Ax + /. 

As mentioned above, such problems can be solved (estimated) simulta- 
neously by simulating only one Markov chain. 
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5.2 INTEGRAL EQUATIONS 


One of the most fruitful applications of Monte Carlo methods is in 
solution of integral equations. The reason is that such equations cannot be 
solved efficiently by classical numerical analysis. 

The idea of solving integral equations by a Monte Carlo method is 
similar to that of solving simultaneous linear equations. Both methods use 
Markov chains for simulation. 

There exists ample literature on solving integral equations by Monte 
Carlo methods (see [3,7-9]). Its history is connected with the problem of 
neutron transport, which is described in Spanier and Gelbard’s monograph 
[9J. One of the earliest methods for solving integral equations by a Monte 
Carlo method was proposed by Albert [1] and was later developed in Refs. 
3, 7, and 8. 

Before proceeding with this topic we need some background on integral 
transforms. 


5.2.1 Integral Transforms 

Throughout this section we follow Sobol [8J. Let K be an integral 
operator such that 

K+(x)~ x x eD, (5.2.1) 

which maps the function ^(t) into K$(x). Ki^(x) is usually called the first 
iteration of ^ with respect to the kernel K . 

The second iteration is 


K[ = / / K(x,x x )K(x v x 2 )^(x 2 )dx x dx 2 . 


Proceeding recursively we obtain 


(5.2.2) 


K[K k ~'+]{x)- K k +(x)= f K(x,x k )K k -'+(x k )dx k , (5.2.3) 

the k th iteration of ^ with respect to the kernel K. 

We can estimate such integrals by quadrature methods or by Monte 
Carlo methods, as described in Chapter 4. However, there exists another 
Monte Carlo method of estimating such integrals, a method that is similar 
to the method of solving systems of simultaneous linear equations and that 
based on simulating a Markov chain. 

Before describing the method let us introduce some notations and make 
some assumptions. 
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For any two functions h(x) and ^(x) their inner product is denoted by 
<A, i//>, where 


<A. «#’> «/ h(x)+(x)dx. 

(5.2.4) 

Assume also that 

Mx)£L 2 (D) 

(5.2.5) 


h(x)(EL 2 {D) 

(5.2.6) 

and 

K(x,y)GL 2 (DxD), 

(5.2.7) 

which is the same as 

j h 2 dx < oo 

(5.2.8) 


j \\> 2 dx < oo 

(5.2.9) 

and 

J j K 2 dx dy < oo, 

(5.2.10) 


respectively. 

It is easy to prove, using the Cauchy-Schwarz inequality, that, if condi- 
tions (5.2.8) and (5.2.9) are met, then )</?,^>| < oo. Indeed 

| jh 2 dx 7 <oo. (5.2.11) 

In exercise 2 the reader is asked to prove AT^(x) £ L 2 (D) y given (5.2.5) and 
(5.2.7). 

With these results we can return to our problem of evaluating K k *j/, As 
we mentioned before, the method of evaluating K k *p is similar to those for 
solving the system of linear equations described in Section 5.1.1. From 
now we consider the problem of finding the inner product (A, 
which is similar to the problem <A, • The reader is asked to 

keep this similarity in mind. 

By analogy with (5.1.10) and (5.1.11) let us introduce any continuous 
Markov chain 




e D 


(5.2.12) 
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satisfying fP(x,y)dy = 1, fp(x)dx = I, such that 

1 /?(*)> 0, if h(x) ^ 0 

2 P(x,y)> 0 UK(xy)*Q y (5.2.13) 

where p{x) and P(x,y) are, respectively, the initial and the transition 
densities of the Markov chain (5.2. 1 2)— (5.2. 1 3). 

By analogy with Proposition 5.1.1 we can readily prove the following 


Proposition 5.11 
where the r.v. 


£[**(*)]-<*’***> 

V k {h) = ^\w k +{x k ) 

P( x o) 


has densities p(x 0 )P(x 0 ,x x )P( x v x 2 )* • • P(x k __ iy x k ) and 

^(**-1.**) 


^ = V, 

P( x k- l’ x k) 


^ 0 = 1 - 


(5.2.14) 

(5.2.15) 


(5.2.16) 


Assuming for some given y that h(x) —p(x) = S(x — /), where <$(*) is 
Dirac’s delta function, we immediately obtain E[i) k (h )] « K k \p. 

The procedure for estimating the inner product K k f)> where K k \(/ is 
defined in (5.2.3), can be written by analogy with the procedure for 
estimating (h f x {A + )} ) in Section 5.1.1 as follows: 


t Choose any integer k > 0. 

2 Simulate N independent random paths 
1,2, . . . , JV, from Markov chain (5.2,l2)-(5,2.!3). 

3 Find 

j -1 N, 


>xi’\ s ■ 


P( x o) 


(5.2.17) 


where 


(5.2.18) 


4 Calculate 




(5.2.19) 


i- 1 


which is an unbiased estimator of the inner product </», AT 
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5.2.2 integral Equations of the Second Kind 

Consider the following integral equation of the second kind: 

z(jc) - f K(x,x t )z(x,)dx x +/(-*), (5.2.20) 

J D 

which can be written as 

z =* Kz +/. (5.2.21) 

Let us assume that /(x) G L 2 (D) f K(x y x x ) G L 2 (D x D\ and 

1*1 -sup f \K(x y y)\dy< 1. (5.2.22) 

D J 0 

Under these assumptions by analogy with (5.1.4) we can estimate (5.2.20), 
applying the following recursive equation: 

z ( * + ,) = Kz ik) +f. (5.2.23) 

Setting Z°^0 and K 0 ^ 0, we get 

k 

z {k + *) j + Kf + • ■ • +K k f~ 2 tf m /. (5.2.24) 

0 

Taking the limit 

k 

lim z { *> = lim 2 

k oo 

we obtain the exact solution of z provided the von Neumann series 
converges. 

One way of estimating <>j,z> is via simulation of a continuous Markov 
chain similar to (5.1.10)' (5.1.1 1). 

The following proposition can be readily proved [8] by analogy with 
Proposition 5.1.1. 


Proposition 5*2.2 For any given vector h 


£[v *(/»)] = { S 0 Ar '" / ) > 


where the r.v. i\ k (h) is defined on the path * 0 -**) 


*?*(*) ■ 


M* 0 ) 


p(*0 ) m -0 


2 


(5.2.25) 
• • —*x k , such that 

(5.2.26) 


X'(jco,jc i )X'(X|,x 2 )- • • K(x„_ t ,x m ) 
P(x 0 ,x x )P(x x ,x 2 )- • • Pix^^xJ 


(5.2.27) 
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The sample mean 

2 **’(*)« (a. 2 K m f) (5.2.28) 

1 J-1 \ m-0 / 

estimates the inner product (h,'2 t k mam0 K m j )> . 

Assuming again /*(x) =/?(x) = 5(x —y) y we obtain (5.2.23). Considering 
an infinite path x 0 ->Xj-> ■ * * — > • • * , we define the random variable 

(5.2.29) 

/K*o) m-0 


It can be shown that for h ) to be an unbiased estimator of <A,z>, 
that is, 

hm E[t} k (h)] = E[ t? 00 < * ) ] - (h,z>, (5.2.30) 

k—*<x> 

it is not enough to assume the convergence of the von Neumann series 
Z*. 0 * "A that is, 

OO 

2/f m /<oo. (5.2.31) 

m™ 0 

The reader is asked in exercise 5 to prove (5.2.30), provided 

OO 

2l^ m /|<oo. (5.2.32) 

m * 0 


It is obvious that, when K{x,y) > 0 and /(x) > 0, both (5.2.31) and (5.2.32) 
coincide. 

Another way of estimating is via simulation of a continuous 

Markov chain with an absorbing state similar to that of (5.l.50)~(5.l.5l). 
Consider the random path x 0 ~>x,— ► ' * • with the absorption time k , 

which is a random variable such that E(k) < oo. Define on this path the 
r.v. (compare with (5.1.52)) 


\k)( h )= 


M*o) 

p(x o) 


/(** 
g(*k) ’ 


(5.2.33) 


where g(x) is the absorption probability, p(x) is the initial distribution, 
and 


w • ' *(**-!’**) 

* P(x 0 ,x l )P(x l ,x 2 )- ■ ■ P(x k . f ,x k ) ' 

Then by analogy with Proposition 5.1.4 we can readily prove 

£ [ W*>] “<*-*>. 

provided 2®_ 0 | K m f\ < <»• 


(5.3.35) 
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To estimate <6,z> we simulate ,V random paths • • • — »x ( ( ^ 

with absorption state and find 

^>-T7 2 V(V, (*)*<*.•«>. (5.2.36) 

' T- l 

The problem x = Ax +/ can be considered as a particular case of the 
problem z = Kz+j. Indeed, let us partition the region D into n mutually 
disjoint subregions £>, j = 1,2 , such that £> * anc * * et us 

assume that /(x) and AT(x,x,) are constant functions in each subregion D iy 
i — 1,2, that is, 

ft*) 9 */,* xED f 

K(x,x x ) « a jy , x e x, e Z) y . (5.2.37) 

Then, for any x G D iy 

n 

*(-*)= 2 f K(x,x')z{x x )dxi+l 
)- \ J °, 
n 

= 2^/" *(*,)<**.+/,• (5.2.38) 

y- 1 

Inasmuch as z(x) does not depend on x, the last formula can be written as 

(5*2.39) 

Thus by partitioning the region D into n disjoint subregions, we can find 
the solution of the integral equation (5.2.19) by solving the system of linear 
equation (5.2.39). 

5.23 Eigenvalue Problem 

Consider the following homogeneous integral equation: 

z(x) K(x,x x )z{x x )dx x , (5.2.40) 

which can be written as 

z = \Kz. (5.2.41) 

If z 7*=0, then A is called the eigenvalue and z(x ) is called the eigenfunction 
corresponding to A. 

Let us assume that the smallest eigenvalue A ; is positive and that the ker- 
nel K(x,y)** K(y,x) is symmetric and positive definite, that is, (K\p> \p) 
> 0 if ip^O. Under these assumptions for any two positive functions / and 
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h we have (see Sobol |8J) 


lim 

m— *00 


_ A 

<h,K" + 'f) 


(5.2.42) 


and 

lim K m f{xKK m f,K m f}-'/ 2 = z(x), (5.2.43) 

m-+ oo 

where z(x ) is the eigenfunction corresponding to A. 

We can estimate (h,K m f) and K m f simultaneously by a Monte Carlo 
method as described in Section 5.2.1. 

For further discussion of eigenvalue problems we refer the reader to 
Hammersley and Handscomb [5] and Sobol [8]. 

Until now we have not made any special assumption about our Markov 
chains. We have required only that the estimators r\ k (h) and r\ {k) (h) be 
unbiased. It is clear that the variance of both T\ k (h) and \ ky (h) depends on 
the transition probabilities P tJ . Since in solving linear and integral equa- 
tions we have, respectively, sums and integrals to deal with, it should be 
possible to use some of the variance reduction techniques of Chapter 4 for 
better efficiency. In this context the reader is referred to Michailov [7] and 
Ermakov [3]. 


53 THE D1R1CHLET PROBLEM 

One of the earliest and most popular illustrations of the Monte Carlo 
method is the solution of Dirichiet’s problem [4]. 

Dirichlet’s problem is to find a continuous and differentiable function u 
over a given domain D with boundary Z>°, satisfying 

d*u 3 2 u 

T-J + T-J ~F(x,y), (x,y) G D (5.3.1) 

ox 

and 

»(x,y) =g(x,y), for (x,y) G D° (5.3.2) 

where g = g(x f y) is some prescribed function. 

Equation (5.3.1) with F(x J y)¥^Q is called the Laplace equation; with 
F{x,y) — 0 it is known as the Poisson equation. 

Generally, there is no analytical solution known to this problem and we 
have to apply a numerical method. We usually start by covering D with a 
grid and replacing the differential equation by its finite-difference ap- 
proximation. Let us denote the closure of D by D, that is, D u D° » D f and 
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the coordinates of the grid by x a — ah and>^ = fjh 9 where h is the step size. 
Taking the two-dimensional case for convenience, we call the point {x a ,y^) 
E D an interior point of D if four neighbor points of (x a >yp), namely, 
(•*<. - (•*„ + h>y&), Ua’.J'/j ~ H and + / ») also belong to D. 

We call (x a> y p )E a boundary point if there are not four neighbor 
points that belong to D, 

Taking this definition into account, we have for any interior point 

U a+ I,/? ~ ^ U afi + U a~ 1,0 . W a,/)+ J “ ^ / \ r* 

+ -F afi> (x a ,y p )eD, 

(5.3.3) 

which is the finite-difference equation of (5.3.1). Here u a ^ = u{x a ,y^) y 
F„ff = Hx a ,x p ), u a±l p = u(x a ±h,y 0 ), and u a0±t «= u(x a ,y 0sl ). The last 
equation can be rewritten as 

W a£ ~ 4'( U a- 1,0 + W a+l,/9 + W a,/J- l W a,/i+ 1 ^^00 )* (5.3.4) 

The boundary condition (5.3.2) is then 

«o./8 = Ba.fi and {x„,y 0 )e.D Q . (5.3.5) 

It is not difficult to see that by numbering all the points ( * a ,yp ) E D in any 
order we can rewrite (5.3.4) and (5.3.5) as 

n 

w, * £ a ,j u j + fi> iJ~ (5.3.6) 

/-i 

Here n is the number of mesh points (x a ».ty) E Z>, which is also equal to 
the order of the matrix ||tf, 7 ||". 

The matrix ||tf jy || has a specific structure: all diagonal elements are 
equal to zero; each row corresponding to an interior point of D has four 
elements equal to ail other elements being zero; each row corresponding 
to a boundary point of D° contains also elements equal to £ or zero, but 
the number of £ elements is as that of neighboring points, which is always 
less than 4, 

Thus the Dirichlet problem is approximated by a system of linear 
equations (5.3.6), which can be solved by the Monte Carlo methods 
described in Section 5.1.3. 


EXERCISES 

1 Describe an algorithm for simulating an ergodic Markov chain. 

2 Prove that X>(jc) e £>), given (5,2.5) and (5.2.7). 
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3 Prove Proposition 5.2.1, that is, E[y] k {h)] * (h, K h %p}. 

4 Let X'V(x)* where 

. . M , if k 

<s*kr+j>= ( Q i fk+j. 

Prove that 

5 Prove (5.2.30), given 2“_ 0 l* m /l < «• 

6 Prove (5.2.35), given 2~. 0 I*'VI < oo. 

7 Consider the recursive formula (5.2.23) 

z* + l = Az*+/. 

Assume z° * <#>(*), where <#>(•*) * s an Y function. Then 

k 

z* +l + 2 r/+^* +! f 

m-0 


Define 


!?*(*) 


*(* P ) 

/>(*<>) 


A 


2 ^«/(0+»W(** + .) 
0 


and prove that 

£[n*(A)]-<A.* u+,, >. 

8 Prove (5.1.43), that is, prove that asymptotically the method of covering path is n 
times more efficient than the standard Monte Carlo method. 

9 Consider the systems of linear equations x = Ax +/, where 

H«". SHE El and 

The exact solution of this is x » (x,,x 2 ) ** (7.5,8.75). 

Simulate the following Markov chain with an absorbing state: 


P 


0.5 0.2 0.3 

0.3 0.4 0.3 

0 0 I 


and estimate the exact solution x = (x,, x 2 ) = (7.5, 8.75) by making a run of the 
1000th replication of the Markov chain. 
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CHAPTER6 

Regenerative Method for 
Simulation Analysis 


6,1 INTRODUCTION 

It has already been mentioned in Chapter 1 that many real-world problems 
are too complex to be solved by analytical methods and that the most 
practical approach to their study is through simulation. In this chapter we 
consider simulation of a stochastic system, that is, of a system with random 
elements. Simulation of such systems can be considered as a statistical 
experiment, in which we seek valid statistical inferences about some 
unknown parameters associated with the output of the system (or the 
associated model) being simulated. However, classical methods of statistics 
are often unsuitable for estimating these parameters. The reason, as we see 
later, is that the observations made on the simulated system are highly 
correlated and nonstationary in time; under these circumstances it is 
difficult (actually impossible) to carry out adequate statistical analyses of 
the simulated data. To overcome these difficulties a procedure based on 
regenerative phenomena, called the regenerative method, has recently been 
developed. 

Historically, Cox and Smith [4] were the first to suggest use of regenera- 
tive phenomena for simulating a queueing system with Poisson arrivals. 
This idea was extended by Kabak [39] and Poliak [59]. Quite recently, 
Crane and Iglehart [6-9] developed a methodology for the regenerative 
method, based on a unified approach to analyze the output of simulations 
of those systems that have the property of self-regeneration, that is, of 
invariably returning (at particular times) to the conditions under which the 
future of the simulation becomes a probabilistic replica of its past. In other 
words, if the simulation output is viewed as a stochastic process, the 

183 
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regenerative property means that at those particular times the future 
behavior of the process is independent of its past behavior, and is 
governed by the same probability law, that is, at those times the stochastic 
process “starts afresh probabilistically.” Crane and Iglehart showed that a 
wide variety of problems, such as communication networks, queues, main- 
tenance and inventory control systems, can be cast into a common 
framework using regenerative phenomena; they then proposed a simple 
technique for obtaining point estimators and confidence intervals for 
parameters associated with the simulation output. 

The regenerative method also provides answers to the following im- 
portant problems: how and when to start the simulation, how long to run 
it, when to begin collecting data, and how to deal with highly correlated 
data. 

The theory and practice of the regenerative method are now in the 
process of rapid development. The list of references contains about 100 
relevant papers known to the author. An excellent introduction to the 
regenerative method can be found in Crane and Lemoine’s book [10]. 
Iglehart's forthcoming monograph [38] will present a rigorous development 
of both the theory and practice. Many others recently obtained results, in 
particular regarding simulation of response time in networks of queues, are 
to be found in Iglehart and Shcdler’s monograph [37]. 

This chapter is organized as follows. The basic ideas of the regenerative 
method are discussed in Section 6.2. Section 6.3 deals with statistical 
problems, in particular with the confidence interval for the expected values 
of some functions defined on the steady-slate distribution of the process 
being simulated. In Section 6.4 the ideas of the regenerative method are 
illustrated for a single-server queue, a repairman system, and a closed 
queueing system. Choice of the best among a set of competing systems is 
the subject of Section 6.5. Section 6.6 deals with a linear programming 
problem in which the coefficients are unknown and presents the output 
parameters of regenerative processes. Variance reduction techniques in 
regenerative simulation are the subject of Section 6.7. 


REGENERATIVE SIMULATION 

We start this section with the definition of a regenerative process. Roughly 
speaking, a stochastic process { X(t ) : t > 0} is called regenerative if there 
exist certain random times 0 < T 0 < T, < T 2 < * * • forming a renewal 
process* such that at each such time the future of the process becomes a 

*A sequence of random variables {T„ : n > 0} is a renewal process provided lhal T 0 » 0 and 
T n - , {n > l) are i.i.d. r.v/s. 
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probabilistic replica of the process itself. Informally, this means that at 
these times the future behavior of the process is independent of its past 
behavior and is invariably governed by the same law. In other words, the 
part of the process {X(t): T>-x<t <T t ) defined between any pair of 
successive times is a statistically independent probabilistic replica of any 
other part of the same process defined between any other pair of successive 
times. 

The times {7] : i > 0} are called regeneration times and the time between 
Tf _ , and T t is referred to as the length of the / th cycle. Formally [5], a 
stochastic process >0} is regenerative if there exists a sequence 

T 0 , r,, . . . of stopping times* such that: 

1 T = {T t : / — 0, 1, . . . } is a renewal process. 

2 For any /, m E {0, I, . . . }, 0, the random vectors 

{ AT ( / 1 )» . . . , X{tf)} and {X(T m + /,) X(T m +ti] are identically distrib- 

uted and the processes {X{t): t <T m ) and [X(T m + /):/> 0} are indepen- 
dent. 

For example, let {X n : n > 0} be an irreducible, aperiodic, and positive 
recurrent Markov chain with a countable state space / *» {0, 1, . . . }, and let 
j be a fixed state; then every time at which state j is entered is a time of 
regeneration. 

Let us select a fixed state of the Markov chain (M.C.), say state 0. We 
then obtain a sequence of stopping times {7]:i> 0} such that 0 ■* T 0 < 
T\ < T 2 < ’ * - and X r = 0 almost surely (a.s.); that is, once the system 
enters state 0, the simulation can proceed without any knowledge of its 
past history. 

For another example, let us consider the queue size at time ( for a 
GI/G / 1 queueing system. Suppose the time origin is taken to be an instant 
of departure at which time the departing customer leaves behind exactly j 
customers. Then every time a departure occurs leaving behind j customers, 
the future of the stochastic process after such a time obeys exactly the 
same probability law as when the process started at time zero. More 
examples of regenerative processes are considered in Section 6.4. 

It is shown in Ref. 8 that under certain mild regularity conditions the 
process {A'(/):r>0} has a limiting steady-state distribution in the sense 
that there exists a random vector X such that 

lim P{X{t)<x}~P(X <x). 

oo 


* A random variable T taking values in [0, + , oo) is a stopping time IS) Iot a stochastic process 
: i > 0) t provided that for every finite / > 0, the occurrence or nonoccunence of the 
event {T < t) can be determined from the history {*( 5 ) : s < f) of the process up to time (. 
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This type of convergence is known as weak convergence and is denoted 
X(t)=*X as ► oo. The random vector X is called the steady-state vector. 

Let f:R k ~>R be a given real-valued measurable function, and suppose 
we wish to estimate the value /* = E{f(X)] i where X is the steady-state 
vector. 

For the M.C. {X n : n > 0} we have 

r = £{/(*)} - 2/(0^-/)-2A0*5. (6.2.1) 

»e/ isi 


Here, n = {P(X = /) : i E /} is the steady-state (stationary) distribution of 
the regenerative process {*„:/?>()}, and /(/) can be interpreted as the 
penalty (reward) paid in state i. To find r we can solve the following linear 
system of stationary equations, ti — ttP, where P ® {P^ : ij E 1} is the 
transition matrix, and then apply (6.2.1). 

Let us assume that the values /(/) are known but the transition matrix is 
unknown. It is clear that the value r cannot be found analytically, since m 
is determined by P , and simulation must be used. Another case is when P 
is known but the state space is very large; in this case it may be quite 
difficult to solve the system it ® ttP, and we must resort to simulation 
again. 

Possible functions / of interest are the following: 


1 If 


then E{/(X)} = ttj. 

2 If 



' 




/(»)■ 


(i 


' >j* 

' </. 




then E{f(X)) = P{X>j). 

3 If/(/)«/',/7>0, then E{f(X)) = E{X P }. 

4 If /(i) = c i = cost of being in state /, then E{ f(X)} « 2 ie i c iP{X » /} 
(the stationary expected cost per unit time). 

Let Tj denote the interval between the / th and the (i + l)th regeneration 
times, that is t, « 7J + , — T t , i > 0; r. is referred to as the length of the /th 
cycle. Next, assume E(r f ) < oo, and define 

( 6 , 2 . 2 ) 

J T, 
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or 

r,- 2 /(*,)> (6-2.3) 

j-r. 

depending on whether the process (X(t ) : r > 0} is continuous-time or 
discrete-time. In other words, Y- is the penalty (reward) during the cycle of 
length r, = 7J + , — T r Naturally, Y i is a random variable (r.v.) because so 
are r, and f(X-). 

We now formulate two fundamental propositions that are used exten- 
sively in the rest of this chapter. 

Proposition 6.2.1, The sequence {(K 1 ,t i .) : i > 1) consists of independent 
and identically distributed (i.i.d) random vectors. 

Proposition 6.2.2. if r, is aperiodic,* £(t,)<oo, and E{\f(X)\) < oo, 
then 

£{/(*)} (6.2.4) 

There is an analogous ratio formula when t x is periodic. For proof of these 
propositions the reader is referred to [8J. 

Proposition 6.2.1 says that the behavior patterns of the system during 
different cycles are statistically independent and identically distributed. 
Proposition 6.2.2. enables us to estimate the value r » E{Y X )/ E{r x ) (which 
is the same asr* £(Y;)/E(t,)) by classical statistical methods, and to find 
point estimators and confidence intervals for r. These two problems are the 
subject of the next section. 


6-3 POINT ESTIMATORS AND CONFIDENCE INTERVALS [22, 28) 

In this section we consider several point estimators and confidence inter- 
vals for the ratio E{Y t )/ The problem we consider is as follows: 
given the i.i.d. sequence of random vectors {(Y i ,r i ):i> 1}, find point 
estimators and construct 100(1 - 8)% confidence intervals for the ratio 

EiYJ/Eir,). 


•The random variable t, is periodic with period A > 0 if, with probability I, it assumes values 
in the set {0, A, 2A, ... } and A is the largest such number. If there is no such A, then r, is said 
to be aperiodic. 
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Let Z, = Yj- rr,. It is readily seen that the Z, ’s are i.i.d. r.v.’s, since the 
vectors also are. Note also that 

£(Z,) = 0 (6.3.1) 

and 

a 2 = var(Z,) = var(J') - 2rco\{Y„r i ) + r 2 var(T,). (6.3.2) 

Denote Y = (l/n)2 and f = (1 /n)2 ( n - \ T ,l then by virtue of the central 
limit theorem, (c.l.t.) we have 

„i/2f y-rr 1 

=>fV(0, 1) as n— >oo, (6.3.3) 

where => denotes weak convergence and it is assumed that o 2 < co. The 
last formula can be rewritten as 


- — — _ r ^ ==»A^(0 T 1) as/i— >oo, (6;3.4) 

o/t 

where r « Y/f. fnasmuch as o is unknown, we cannot obtain a confidence 
interval for r directly from (6,3.4). However, we can estimate o 2 in (6.3.2) 
from the sample, that is, by 

s 2 » J n - 2rs 12 + r 2 s 2 2 , (6.3,5) 

where 

n /-I " 1 I- 1 

and 

2(n- F)(r, -f). 

" r«l 

It is straightforward to see that j 2 -*o 2 a.s. as n— ► oo, so (6.3,4) can be 
rewritten as 

- — — -=>/V(0, I), asn— >oo, (6.3.6) 

s/r 

and the 100(1 - 6)% confidence interval for r * £(J^)/£(t,) is 




f/i ^ 2 r 


(6.3.7) 


where ***^*"*(1 — S/2), <j> is the standard normal distribution function, 
and r~ Y/i is the point estimator of E(Y X )/ E(r x ). The procedure for 
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obtaining a 100(1 - 6)% confidence interval for r can be written as follows: 


1 Simulate n cycles of the regenerative process. 

2 Compute the sequence r lt ...,T n and the associated sequence 
y,,..., Y n (use (6.2.2) and (6.2.3), respectively, for a continuous-time or 
discrete-time process). 

3 Compute Y = (l/«)2"* ,1^ and f = (l/n)2?.,T rf and find the point 
estimator by 


r * 


r 

T 


(6.3.8) 


4 Construct the confidence interval by 


/=* 


r -Z 


z jj 

in 1 ' 2 


where z d * l (l - 3/2 ) and <f> is the standard normal distribution. 

It is readily seen that r= Y / t, referred to as the classical estimator [28], 
is a biased but consistent estimator of E(Y } )/E(r } ). lglehart [28] suggested, 
for the same purpose, the following alternatives: 


BEALE ESTIMATOR 


r b (n) 


Y ( 1 + 5 , 2 /^ t ) 


T (1 +s 21 /ni 2 ) 


2\ ’ 


FIELLER ESTIMATOR 


where 


Yr — k A s 




^ n)= j2_ k s 

7 K 6 S 77 


[<*>-'(! -«/ 2)] 2 


JACKKNIFE ESTIMATOR 




/»*» 1 


HIH" 


o 


2 y k 

k+t 


(6.3.9) 


(6.3.10) 


(6.3.11) 


where 
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TIN ESTIMATOR 


r t (n )^\ ( 6 ‘ 3 * 12 ) 

T n \Yr r 2 / 


Let us now cite some results from Ref. 28. The four point estimators 
(6.3.9) through (6.3.12) as well as the classical estimator are biased. Their 
expected value can be expressed as 

E[f{n)] = r + ^ + ^f + 0 (-t)- (6- 3 -13) 

n n 2 \nr l 

The point estimators (6.3.9), (6,3.11), and (6.3.12) have been suggested in 
order to reduce the bias of (6.3.13) up to order 1/n 2 . For the jackknife 
method c. — 0, since 




-(«- l)|r + — ^-r + — ^—j + 0 1 — - 

n ~ l („- l) 2 (n - l) 3 




The reader is asked to prove that for both Beale and Tin estimators c x /n is 
also equal to zero. 

Since both n lj/2 (r - r b )-+ 0 and n ]/2 (r ~ r/)— >0 a.s. as n— »oo, we can 
replace r both in (6,3.6) for the c.l.t. and (6.3.7) for the confidence interval 
without changing the results. 

For the jackknife method formulas (6.3.6) and (6,3.7) can be written, 
respectively, as 


► Af(0, 1) as oo 


(6,3,14) 


/, = r 


where 


- f v (AzilV 

Sj IS («-»)) ■ 

The Fieller method yields the following 100(1 — 6)% confidence interval: 


(* 2 (* 2 -W ’ 
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where 

£)* (yr-/c fi 5 , 2 ) — (f 2 — £$^22) ' ~^a 5 u) 

and 

, _ [t-W-s/i )} 2 

k *~ n 

The performances of these estimators were compared numerically (via 
simulating several stochastic models), and the following results were ob- 
tained [28]- 

For short runs the jackknife method is recommended both for point 
estimators and confidence intervals because it produces slightly better 
statistical results than other methods. Two minor drawbacks of the jack- 
knife method are a large memory requirement and slightly more complex 
programming. Additional storage addresses of the order of 2 n are required, 
where n is the number of cycles observed. Where the storage requirement 
for the jackknife method is excessive, the Beale or Tin methods are 
recommended for point estimates and the classical method for the confi- 
dence intervals. The Fielier method is recommended for neither point nor 
confidence intervals. It is found to be heavily biased for short runs and 
more complicated than the classical method. The above mentioned five 
point estimators were based on simulating n cycles of regenerative 
processes. Another possibility is to consider point estimators based on the 
simulation for a fixed (but large) length of time t. In this case the number 
of cycles N t in the interval (0, /] is a random variable given by 

2 ws> 

S— ) 

where I i0 /J is the indicator function of the interval [0, f). Replacing n by 
N,, we can modify all the point estimators (6.3.8) through (6.3.12), preserv- 
ing their consistency. For example, for the classical estimator we have 




yw 

HN.) ' 


Thus, asymptotically, there is little difference while considering point 
estimators based on simulation n regenerative cycles or on simulation for 
fixed length of time t. The c.l.t. in this case is 





/V(0,1) 


as /— ► 00 


o{£(t,)} 


(6.3.16) 
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Recently Heidelberger and Meketon [22] considered estimators based on 
simulations for a relatively short length of time (. They defined estimators 



N, 

(6.3.17) 

and 

2 r i 
1 1- l 

A, + l 

2 X 

(6.3.18) 

They then showed that 

2 T, 

r**l 

£(r(Af,))«r + 0(j) 

(6.3.19) 

and 

£(f(Af,+ l))«r+o(l) f 

(6.3.20) 


so that a bias reduction is achieved by continuing the simulation until the 
first regeneration after time /. The bias reduction is comparable to that of 
the jackknife, Beale, and tin estimators since t is proportional to the 
number of cycles. Table 6.4.3 lists empirical results from simulations of a 
closed queueing network model for these estimators. 

We turn now to the problem of determining run length. The 100(1 — d)% 
confidence interval for a large but fixed number of cycles has a width 
approximately equal to 


2o<ft~'(l -8/2) 


(6.3.21) 


In terms of duration time I (6.3.20) can be written as (see [24]) 


2 \ - S/2) 
{E{r x )) X/2 l^ 


Note that neither a nor E(r x ) are known in advance. Hence it may be 
worthwhile to take a small sample and obtain rough estimates for a and 
£(tj). Such estimates would form a basis for a final decision or run length 
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and level of confidence. We wish to emphasize that all ratio estimators 
described in this section are designed for simulations with a fixed number 
of cycles n or a fixed run length t . An alternative possibility would be to 
consider procedures based on sequential stopping rules. 


6.4 EXAMPLES OF REGENERATIVE PROCESSES 

In this section we consider three examples of regenerative processes, taken 
from Refs. 6, 10, and 49: a single server queue, a repair model with spares, 
and a closed queueing network. 

6.4.1 A Single Server Queue GI/G/J [6] 

This example was described in Section 4.3.12, and will be briefly 
recapitulated here. 

Let W L and S i be the waiting time and service time, respectively, of the 
i th customer in a single server queue. Let j be the time between the 
arrival of the /th and (/ 4- l)th customers. We assume that {S,, / > 0} are 
i.i.d. with £(S ( )-p~ l and that {A f J> 1} are i.i.d. with E(A S ) =» Let 
the traffic intensity p be defined by p=~A/p. We assume that customer 
number 0 arrives at time 0 to an empty system. Let X i = x ~~ A t for / > 1 . 

The waiting time process { W i , / > 0} can be defined recursively by 

H'o-O 

+ /> i. 

It is known [36] that, if p < I, there exists an infinite number of indices i 
such that W k 0 and a random variable W such that as /— >o o. 

Thus we choose zero state as our return state and regenerations occur 
whenever a customer arrives to find an empty queue. We are interested in 
estimating E(W), which is finite if £(S„ 2 ) < oo. 

Since no analytical results are available for calculating the steady-state 
waiting time E(W) y we estimate it via simulation by making use of the 
classical estimator (6.3,8). The simulation results are shown in Fig. 6.4.1. 

We see that the customers 1, 3, 4, 7, 11, and 16 find the server idle, that 
is, W u = W xb « 0, while customers 2, 5, 6, 8, 9, 10, 

12, 13, 14, and 15 find the server busy and wait in the queue before being 
served. 

It follows from Fig. 6.4.1 that the simulation data contains five complete 
cycles with the following pairs {(J', t,),i *» 1, . . . , 5) : (T,,^) * (10, 2), 
(Y 2 ,r 2 ) = ( 0,1), (T 3 ,t 3 ) = (30, 3), (T 4 , t 4 ) = (50, 4), and {Y 5 , r 5 ) - (60, 5)}. 
The sixth cycle will start with the arrival of customer 16. Using the 
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Customer number 

Fig. 6.4.1 Sample output of queueing simulation. 


classical estimator f = 1 ^/ 27 - |T,, we obtain 


s 


2 r, 

i-i 



/= i 


10 + 0 + 30 + SO + 60 


150 

15 


10 . 


This result can also be obtained by using the sample-mean estimator 
- 1 v u/ 1 V ^ 150 .ft 

'■Ts' 


Here N = 2f.iT- = 15 is the length of the run and Zy 5 _,Wy ** 2f_j>V 
A logical question arises. If both points estimators r and r are equal (we 
assume that the length of the run /V is equal to n complete cycles, n < N), 
why do we need all the ratio estimators (6.3.8) through (6.3.12), (6.3.17) 
and (6.3.18), based on the regenerative phenomena? 

The answer can be found if we consider not only point estimators for 
r**E(W) but confidence intervals as well. In order to construct confi- 
dence intervals in the sense of classical statistics, the simulation data must 
form a sequence of i.i.d. samples from the same underlying probability 
distribution. The simulation data from the queueing system is the sequence 
of waiting times W t , . . . , W N . Note, however, that if we start our simulation 
with an empty queueing system, then the first few waiting times tend to be 
short, that is, they are correlated, and as a rule, the sample-mean estimator 
r will be a biased estimator of r ~ E(W). 
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Table 6.4.1 . Simulation Results for tbe M/ M/\ Queue 


Parameter 

Theoretical 

Value 

Point Estimates 

Confidence Interval 

£(t,) 

0.100 

0.110 

[0.096,0.123] 

E(W 2 ) 

0.040 

0.046 

[0.035,0.056] 

E{V(w- o.i) } 

0.120 

0.133 

[0.116,0.148] 

E(r t ) 

2.000 

2.110 

[2.012,2.207] 

o(K') 

0.173 

0.182 

[0.141,0.271] 


Source: Ref. 6, 

Note: Number of cycles n » 2000, level of confidence 100(1 — 5) = 90%, number of 
replications N =* 10, A * 5, * 10). 


To overcome this difficulty we can run the model until it reaches the 
steady state and then start collecting and updating the simulation data. 
The problem of determining the steady-state distribution is a difficult one, 
moreover, requiring considerable computation (CPU) time, but unless we 
start from it W i and W iJf , will again be correlated (if W i is short, then W - + , 
will also tend to be short and vice versa). Since the r.v.’s W s are 

correlated, classical statistical methods cannot be applied in constructing 
confidence intervals for r== E(W). Still, this difficulty can be overcome by 
using the regenerative property, namely, by grouping the simulation data 
into independent pairs (blocks) ( Y v r x K rt ,T ff ) t which yields different 
ratio estimators (see (6.3.8) through (6.3.12), (6.3.17) and (6.3.18)) and the 
associated confidence intervals by means of classical statistics. Table 6.4.1 
presents simulation results for the queueing system M/M / 1 with A = 5, 
/i = 10 based on a run of 2000 cycles. Confidence intervals at the 90% level 
are given for the parameters £( W ), E{ W 2 ), ^{(V^Tp -0.1 ) + }, E{t x ), and 

a (W). The function E{yj{ W - 0.1) *} may be interpreted as a penalty for 
long waiting time. 


6.4.2. A Repairman Model with Spares [10] 

We now consider a repairman problem with n operating units and m 
spares (Fig. 6,4.2). Each of the operating units fails with rate A. A failed 
unit enters a queue for service from one of s repairmen on a first-in-first-out 
(FIFO) basis and is replaced by a spare (if available). The distribution of 
the i.i.d. repair times is exponential with mean /i’ 1 for each repairman, A 
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-K .1 .1 .1 1 . 

Queue 


i Repairmen 


m Spares 

n Operating 
units 

Fig. 6.4.2 Repairman model with spares. 


repaired unit enters the pool of spares unless there are fewer than n units in 
operation, in which case it immediately becomes operational. Denoting by 
X(t) the number of units in service or waiting in the queue for service, then 
{X(t),t > 0} is a birth and death process with state space / «= {0, 1 , . . . >m 
+ n}, and 


/ n ^' 

0 < i < m 

1 \ (n + m — i)A, 

m < / < n + m 

( *>. 

1 <i <s 

W. 

s < i < m + n 


Let us simulate the system for T units of time and have as output the 
values X(t ), 0 < t <, T, where A(/) is the number of units at the repair 
facility at time t. The sample mean (1 /T)f^X(t)dt is a consistent estima- 
tor for E( X) where E(X) is the mean number of units at the repair facility 
under steady-state conditions. However, unless the value A"(0) is obtained 
by sampling from the steady-state distribution of X, the sample mean will 
be a biased estimator due to the initial conditions. Moreover, it is seen 
that, if /, is close to then X(t } ) and X(t 2 ) will be highly correlated, 
because the number of umts in the repair facility usually does not change 
quickly. 

Due to the initial bias of the estimator and to the correlation of the 
output data, it is impossible to apply classical statistics in estimating the 
steady-state value r~E(X). However, by again applying the regenerative 
approach the difficulty can be overcome. From here on we repeat in 
essence what was done for the queueing simulation. 

The process {X(t):t> 0} is a regenerative one in continuous time and 

P{X(i)-i)^P(X-i) as f > oo for all / E / 
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Table 6.4.2. Simulation Results for Repairman Model 


Parameter 

Theoretical Value 

Confidence Interval 

E(X) 

5.353 

[5.238,5.432] 

£{(X-5) + } 

1.269 

[1.201, 1.325J 

P{X > 5} 

0.465 

[0.444,0.475] 

£{*>0} 

0.988 

[0.987,0.990] 

/’{A'-O} 

0.012 

[0.010,0.013] 

E( r,) 

42.021 

[37.459,47.681] 

£(T|) 

73.375 

[65.262,83.342] 


Source ; Ref. 10. 

Note: Run length = 500 cycles; level of confidence * 95%. 


Suppose we start the simulation at time T x = 0 with n operating units 
and m spares, that is, at T x = 0 the repair facility is empty; then the 
sequence is {7^ : i > 0}, where 7] is defined as the regeneration time when 
the repair facility becomes empty. In other words, the system “starts afresh 
probabilistically,” or regenerates itself, at each time T r For any real-valued 
measurable function / we define 

f 7 '*'/(x(r ))<*; 

J T, 


then the pairs ( Y t ,T t (Y„,r„), where t, = T i+I - 7j, are i.i.d. Suppose 
that the simulation time T exactly equals n cycles; then 


T 


l 



T f{x{t))dt = i=i— 

Sr, 


J-l 


71 1- 1 
i n 
* ; - 1 


is a biased but consistent estimator for r ~ E(/( X)) ■* E(Y l )/E(r i ). 

Table 6.4.2 gives simulation results for some output parameters based on 
run of 500 cycles. £*(f,) represents the number of failures over a cycle. It is 
assumed that n * 10, m — 5, s = 4, and p ~ 2. The “lifetime” of an operat- 
ing unit is exponentially distributed with A = 5. 


6.4J A Closed Queueing Network [49] 

Consider a closed queueing system that is a model of the time-sharing 
computer system in Fig. 6.4.3. The network comprises M service centers 
with a fixed number N of customers. Service center 1 consists of N 
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Fig. 6.43 


terminals (identical parallel servers); hence a customer at this center never 
has to wait for a server to become free. Service center 2 is a single server 
processor, that is, all customers receive service immediately, and if there 
are k customers present each customer is served at 1 /k of the server’s rate. 
Service centers 3, . . . , M represent peripheral input-output devices (single 
server queues), each of which is scheduled on a FIFO basis. A customer 
(device) completing service at service center 1 immediately enters service 
center 2, and immediately thereafter service center j with probability 
Pj > 0, j « 3, . . . , A/, where £*L 3 p i » 1. After completing service at service 
center j\j * 3, . . . , M , the customer enters service center I with probability 
I —p y or service center 2 with probability p. Service times at service centers 
j « 1,2, are i.i.d. and exponentially distributed with mean pj\ It is 

assumed that routing through the network is Markovian and that all 
service and routing mechanisms are mutually independent. 

Let where Qj(t) is the number of customers 

at service center j at time t. It can then be shown (49] that {(?(/) : / > 0} is 
a continuous-time irreducible Markov chain, and hence a regenerative 
process. We define a response time as the time interval between a customer’s 
departure from service center I and his next return to it, and let W- be the 
just completed time of the ith customer arriving there. Then W = {W n / > 0} 
is regenerative with regeneration occurring whenever a last customer 
arrives at service center 1 leaving centers 2 ,..., M empty. Again, we are 
interested in the expected stationary response time r*=E(W\ which is 
known to be finite (49]. Let p t be the utilization of service center j, that is, 
the long run average proportion of time the server there is busy. The 
particular parameters chosen for this model are listed in Fig. 6.4.3 and 
yield p 2 - 0.894, p i - 0.268, r - 8.65. 

Table 6.4.3 presents point estimators and 90% confidence intervals for 
several ratio estimators discussed in Section 6.4.3. 
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TABLE 6.43. Point Estimates and 90 Confidence Intervals for E( W) * 8.65 In 
Closed Queuing Network 


Estimate 

yv = 5 
/ — 220 

iV — 10 

/ =» 440 

iV« 30 
t - 1320 

/V- 50 
/ = 2220 

KN,) 

8.28 

8.46 

8.55 

8.59 


±0.10 

±0.07 

±0.07 

±0.07 

?(N t +\) 

8.64 

8.60 

8.62 

8.63 


±0.10 

±0.07 

±0.07 

±0.07 

Classical r 

8.23 

8.50 

8.56 

8.60 


±0.17 

±0.09 

±0.07 

±0.08 

Jackknife 

8.93 

8.71 

8.61 

8.62 


±0.23 

±0.09 

±0.07 

±0.08 


Source : Ref ,22. 

Note: N ** number of cycles simulated; i « number of response times simulated; 
R =*• 200 replications for t = 220, 440; R — 100 replications for t « 1320; R = 60 
replications for t « 2200. 


63 SELECTING THE BEST STABLE STOCHASTIC SYSTEM 

In this section we consider some techniques for selecting the best system 
from among m alternative systems according to a certain criteria. 

Assume that N(N > 2) stochastic systems are being simulated, each 
giving rise to a regenerative process {A"(f ) : / > 0}, / * 1 N. For exam- 

ple, N alternative designs are considered for a new system. 

Suppose that the measure of performance for the ith system is 

r i = E{f(X i )), i = I N (6.5.1) 

where /is a real-valued bounded measurable function, X 1 is the steady-state 
random variable of the regenerative process {X*(t ) : / > 0}. The problem is 
to choose the best system, that is, the system with the smallest value of r t : 

r f * min r t =* min E{f(X , )\. (6.5.2) 

/=!,.. ,,/v 7 

(We are minimising r; the alternative problem of maximizing r, can be 
considered as well.) 

IgJehart [30] presents a method based on the following scheme. Two 
positive numbers P * and 8* are specified. Then with probability P* the 
system with the smallest (largest) r i is selected whenever that value of r t is 
separated by at least 8* from the other ^’s. Two procedures have been 
considered in Ref. 30 for this problem. The first procedure is sequential 
and the second is two-stage. Both procedures involve the use of normal 
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approximations and require large samples in terms of the number of cycles 
of the regenerative processes simulated. 

We consider here another adaptive approach suggested by Rubinstein 
[61]. Our method is based on an iterative procedure that selects the best 
system with probability 1. 

We start solving the problem (6.5.2) by considering the following linear 
programming problem: 


subject to 


N 

min W{p) = min 2 E{ f{X')}pn 
p p #«i 

v 

X Pi = 1 , Pi £ 0, 1 = 1 N. 

I- 1 


(6.5.3) 


(6.5.4) 


If there exists a unique solution of (6.5.2), then the problem (6.5.3)- (6.5.4) 
is equivalent to (6.5.2) and its solution is given by a vector p* with a single 
nonzero component: 


p* * 1,0,. ..,0} . 


(6.5.5) 


i 


The algorithm for solving the problem (6.5.3)-(6.5.4) is based on a 
step-by-step correction of the probability vector p[n], where n denotes the 
step number. There exists a mechanism, provided by (6.5.9) below, which 
ensures that P;[n] > e(n], where {r[nj}^ 0 * s a monotone 

decreasing sequence of positive numbers, subject to (6.5.13) and (6.5.14) 
below. On the nth step the ith system, i €= {1, . . N} y is chosen by 
simulating the distribution p\n — 1]. We denote this event by X[n]^X‘, 
One cycle of the process {X\t)\t > 0) is carried out. Denote by y f [n]> 
i s* l, . . . , n, the total number of renewal cycles made by the z'th system up 
to and including the nth step. We check whether or not the inequality 
v k [n - 1] > ne[n], k E {I, . . . ,i - l,i + l, . . . , N) is satisfied for all systems. 
If for some indices 1, i + 1, , . . ,N} f this inequality 

does not hold then one additional cycle is carried out for each system 
k ,, . . . , k s> so that ultimately 

n] > ne[n], (6.5.6) 


We record 


= T\ lmX — T\ 


"I 


J-P 


k = i. A:,, ... , k s . 


the lengths of the cycles performed, and for each k calculate 

Y„ k = r Vl /(**(/))<*, k = i,k k , 

r ^l« 1-1 


if the process {X\t ) : / > 0} is continuous-time. 


(6.5.7) 
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In the case of a discrete-time process the integral should be replaced by 
the corresponding sum over the y*[/i]th cycle. Set also 

Yj-rt- 0 , (*- 5 . 8 ) 

We construct a new distribution p[n] by the following recurrence for- 
mula: 

/»[«] 1] -y[«3«(n|i)}. (6.5.9) 

Here S t is a simplex in R y : 


P=(p i ,...,Py) 


2 Pk = 1.0 <e<p k < 1 
k-\ 


tt s is the projection operator onto the simplex S c , such that, for any 
x e R\ 

li* -**, ( z >n = min ll 2 ~y ll» 

yes, 

and fl(-j-) is a vector {/?,(* |-), . . . , B N (- 1 *)}> where 
B k (n\i)^8 ik p^ l [n- !]/*[/»] 

r k [") 


M 


*[«] 


(6.5.10) 

(6.5.11) 


Y k [n] = Y k [n- l] + Y„ k , r*[ n] - r k [ n - 1 ] + t*. *=1,...,^ 

(6.5.12) 

8 " - 1 0, % 


i — k 
i # k 


The initial values of /?[0]eS c j 0 j, T[0] = ( K ! [0], . . . , Y v [0]), and r[0] 3 
( V[0], . . . , t^JO]) can be chosen arbitrarily, for example, T*(0) =* 0, t*(0) « 


0, k = 1, . . . , N. The sequences {yMKT-i and {c[n]}®„ 0 
that the following conditions are satisfied: 

must be chosen so 

yMJ. 0, e G [ n]|,0 

(6.5.13) 

2 y[ n] = oo 

n- 1 

(6.5.14) 

5 y 2 M <00 

(6.5.15) 

£ <». 

(6.5.16) 


i»- l e[ n]Vn 



202 REGENERATIVE METHOD FOR SIMULATION ANALYSIS 

Remark J In order to satisfy conditions (6.5.13) through (6.5.16) take, for 
example, 

y[n]~~ e[n]~~n~ 0A . 

Remark 2 We assume that t k > t 0 > 0, k *= n =* 1,2,..., that is, 

that the cycle is taken into account only if it is of some minimal length 
(which can be considered as the sensitivity threshold of the measuring 
instrument). 

Remark 3 The r.v.’s Y k [n] and T*[nJ, A— jV, n> 1, defined in 
(6.5.12), store the information obtained up to and including the rtth step. 
We should also note that, for each k fixed, only v u [n] summands in both 
Y k [n] and r k [n] are nonzero. 

Theorem 6.5.1 If the values of the function /are uniformly bounded by 
some constant D and if there exists the unique optimal solution /?* of the 
problem (6.5.3)-(6.5.4), then for any initial distribution p[0] e .S r j 0 j the 
sequence [p[n generated by the algorithm (6.5.9)- (6.5. 14), converges 
to p* with probability 1. 

Corollary The theorem remains valid if we assume that the values of the 
function / cannot be observed directly, but are measured with a random 
noise. In other words, 

i- 1 .-..*, 

where £ is a random vector with an unknown time-independent probability 
distribution function. In this case we can consider another random pro- 
cess; 

{t/'(/):/>0}« {(*'(/),«)}; /-I N- 

If {A"(/):/>0} is regenerative, then {(/'(/): f > 0} is also regenerative 
and the values of Q are uniquely defined for each value of the steady-state 
r.v. U l of the process : / > 0), and 

£{/(*')} = £{£ 4 {<2(*\f)}} =E{Q{U‘)}, /= 1 

Proof of the Theorem Before proving the theorem, let us introduce some 
notation. Let 

«,["] = r ,[«] /[n] = max|/,[/»]| (6.5.17) 

I 

where r % = E{ /(A")), i = 1, . . . , /V, and n = 1,2, 
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On the nth step the state of the algorithm can be described by a 
4/V-dimensional vector Z\n\ = (p[n], r[n] = (t*[/i], . . . , r N [n\), Y[n ] = 
(y l [n]> Y N \n]\v[n]^{v x [n\..,,v N \n])). We first prove the following 
lemma. 

Lemma. For any i[0] such that p[0] 6 5 E | 0|) and t[0] > 0, 

oo 

2 yM*{!'/[«]||S[0]}<o©, /-I JV." (6.5.18) 


Proof Without loss of generality set / = 1 and define 
Z>V:-r x T' n> «-l f 2 , .... 

If a cycle of the regenerative process (A^/) : t > 0} was not carried out on 
the nth step, then Z\ = 0, For all n’s such that a cycle of {Af ! (/) : t > 0} 
was performed on the nth step, the Z\ are i.i.d. r.v.’s with E(Z x n ) = 0 and 
variance o|>. Define also 

Z'[n]-Z l [n- 1] + Z», n- 1.2 Z'[0] - Y'[0] - ^'[O] . 

Then by the Cauchy-Schwarz inequality, 

e{l<,[«Mo])-«{|^4||s[oj} 

S£'/ ! {Z'[»] ! |Z[0]) •£«{(.'[ n ])- ! |s[o]} 

<((Z'[0]) ! + ^) ,/2 .£W{(,'[0] +v '[ n ])-‘|Z[0]), 
where t 0 was defined in Remark 2. Since by (6.5.6) v x [n\ > ne[n\ r we have 

£{|/ 1 [«JII^[0]}<((Z>[0]) I + ^) ,/2 (T , [0]+r ortf [«])- 1 . 

Thus for n large enough 

£ {l'iMll~[°]} <A x e~'[n]n ~ ,/2 t (6.5.19) 

where A x * >4,(5fOJ). Inequality (6.5.19) and condition (6.5.16) imply the 
convergence of the series (6.5. 1 8). Q.E.D. 

Corollary For any state Z[n] of the algorithm on the nth step, 

oo 

2 ?[ /n]|i[ n] }< oo. 

m**n + I 

Now we can prove our theorem. 
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Consider the vector p*[n ] G such that 

«[»]. 


P?[n] 


1 ~(N~ l)e[u], 


i + l 
' = 1, 


(6.5.20) 


where / is defined by (6.5.2) and is unique by the condition of Theorem 
(6.5.1). We have: 


r{p*[n}) 


min r(p). 
P* S 4*) 


(6.5.21) 


By the algorithm (6.5.9)- (6.5.1 1) and the properties of 7 T e{n) and (6.5.20), we 
have 

!!/»["] -/>*[»] II 2 < \\p[n~ 1] -/»•[»- 1]|| 2 + \\p'[n- 1] -p*[n]\\ 2 
+ 2 (p[n- 1] -p*[n- 1 ],P*[n- l]-^*[/i]> 

+ y 1 [n](r i [n])\p i [n- l])" 2 

- 2 y[ »](/»,•[ « — 1] "](/>*[" — I]) - ' 


^ tl P[n~\] -p*[n- l]|| 2 + 3W(e[n — 1] -e[n]) 

- 2yM (p,[n- I] ~ P*[n})r t [n)(p,[n - l]) - ' 

+ D 2 y l [n]{p,[n - l])~\ (6.5.22) 

where <*, • > denotes the inner product. Taking the conditional expectation 
of both sides of (6.5.22) with Z\n — 1] fixed, we obtain 

£{II/>M i]} 

£ \\p[n- 1] -p*[n- l]|| 2 + 3Af(c[n- l] -e[n ]) 

+ D 2 Ny 2 [/j] £ -'[/i-l]-2y[n] 2 (/>,■[ ]~P*[ »])E{r,[ n]|2[ n- 1 ] } 

+ 3W(e[ n — l] — e[n]) + £> 2 A/y 2 [n]e~ l [rt — l] 


-2y[«] 2 (/»,[«- •]-/»?["- 1 ])£{'’, [ «]|*[«- •]} 

I- 1 

+ 2y[n] 2 (/»,*[«- i]} 

i-l 


< ||p[/i- 1] -p*[n- l]j| 2 
+ L{c[n - 1] - e[n]) + D 2 Ny 2 [ n]e“ '[ n - l] 

.V 

~2y[n] 2 (p,[«~ !]-/>,*[*- ]])E{r,[n]\Z[n - l]}. 


(6.5.23) 
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where L = 3/V + 2ND max n y[«]. From (6.5.23) it follows that 
E{\\p[n] -p*0]|| 2 |Z[>- 1]} 

<\\p[n- !]-/?*[/»- 1] || 2 + £(«[«- I] -e[/i]) 

+ D 2 Ny 2 [n}t~'[n- l] 

-2y[>] 2 (Pf[ n ~ •] -/>*[»- •])'', 

!=• 1 

— 2y[ w ] 2 ( a>, [ " - I ] 

1-1 

-/>,*[«- 1 ] } 

< II - l]ll 2 

+ L(e[« — 1 ] — e[ n]) + D 2 Ny 2 [ /i]e ~ *[ n - 1 ] (6.5.24) 

-2yO] 2 (p,[ n ~ i] 

»- 1 

-/>*[»- l])£{/,[n]|£[«- 1]}. 

since by (6.5.21) 

N 

2 (*,[«- * ] —/».*[«— * ])'■,*'■(/>['»- *]) -'■(/»*[«- 1])>°- 

f - J 

Therefore 

E {\\P[»] -/ ? *[«]ll 2 |*[" - !]} 

< \\p[n- 1] ~p*[n - l]|| 2 + L(c[n- l] -e[/i]) 

+ D 2 Ny 2 [n]c~'[n- l] 

+ 2A/y[n]£{/[«]|Z[/»- 1]}. (6.5.25) 

Denote 

30 

»[«] = ||^[«] -p*[«]l| 2 + Z-e[«] + D 2 N 2 y 2 [ m ] e_, [ m ~ 1 ] 

m — n + J 

+ 2 N 2 r[m]£{/[m]|S[n]}. (6.5.26) 

m*/i+ 1 

The first sum in (6.5.26) exists by (6.5.15) and the second by the 
corollary from the lemma. Taking the conditional expectation of both sides 
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in (6.5.26), we obtain 

£{o[n]|Z[«- 1]} = E{\\p[n] -/j*[m]|| 2 |*[/i - l]} + Le[n] 

+ D 2 N £ Y 2 [m]e“'[m- 1] 

m*n+ I 

+ 2N £ y[m]E{E{t[m]\Z[n]}\Z[n- 1 ] } 

m — 1 

= £{11 P[ n ) ~ />*["] II 2 1- [«~ 1]} + Lt[n] 

oo 

+ D 2 N 2 y 2 [ - 1 ] 

m n + I 

oo 

+ 2N 2 y[m]E{t[m]\2[n- 1]}. (6.5.27) 

m — n + I 

The last equality in (6.5.27) is justified by the fact that £[/»] is a Markov 
chain taking values in £ 4W J5]. Using (6.5.25), 

£{t>[n]|5[n~ l]}< \\p[n- l] -p*[n~ l][| 2 

OO 

+ Lt[n- f] + D 2 N 2 Y 2 [ rw]e“'[ m — l] 

/m — n 
oo 

+ 2N 2 y[/n]£{r[m]|i[« — I ]} =■ p[/» — l], 

m — n 

(6.5.28) 

Thus is a supermartingale [5] with respect to Z[n], and cln)— a.s. 
as n -> oo . 

On the other hand, 

oo 

u[n]=‘2N 2 y[^]£{<["»]l z [«]} (6.5.29) 

m — /i + t 

is also a supermartingale, since 

£{“[”]l-[«- 1]} = 2 y[n]E{t[m]\Z[n- I ] } < n[n- l]. 

(6.5.30) 

Therefore w[«]-»u a.s. as n-+ oo and thus |j p[rt] — p*[/i]|| ->v - u a.s. 

Taking the unconditional (i.e., conditioned by Z[0]) expectation of both 
sides of the first inequality in (6.5.24), using (6.5.25), and summing up from 



6.5 SELECTING THE BEST STABLE STOCHASTIC SYSTEM 


207 


n = 1 to n = n v we obtain 

2 E{\\p[n] -/>*[«]|| 2 } < 2 E{\\p[n - ■] -P*[n- l]|| 2 } 

/i— ] n «* 1 

+ L(e[0] -£[«,]) 

"l 

-h£> 2 A^2Y 2 [' , ] e_, [' ? ”^] (6.5.31) 

«— i 

-2 2 y[ n]E{W(p[n-\])-W{p'[n-\})) 

n~ 1 

+ 2*2yM*{'M). 

n « I 

As n y -+ oo the last sum converges according to the lemma. Therefore 

2 y[" + 1 ]£’{/(/»['»])-'■(/>*['»])} < »• (6.5.32) 

n — ) 

By the Fatou lemma 

2y[« + !] { ''(/>["]) -'■(/'*["])} < « a.s. (6.5.33) 

rt-1 

From (6.5.33) and (6,5.14) follows the existence of a subsequence n k such 
that 

II />[«*] ~ />*[»*] II 2 ->0a-S- asn*-»oo. 

Therefore v — u ~ 0 a.s. and || p[n] - /?*[/r]|| --*0 a.s. as n — ► oo. 

On the other hand, p*\n] — >p* y and so 

p[ n ]-+P* a.s. asn->oo. Q.E.D. 

Example Search for an optimal policy in a Markov decision process in 
the absence of a priori information. 

Consider a system of I states, S,, . . . , S r At every stage n ** 1,2,..., one 
of M possible decisions Z),, . . . , D M must be made. Denote by S[n] and 
D[n] the state and decision made in stage n, respectively. If S[n] ~ S f and 
D[n] « D k , then the system moves at the next stage, n- hi, into the state Sj 
with an a priori unknown probability 

= Pr {5[n + 1 ] = 5 y |5[«] = S„ D[n] «/>*}• 

This transition, if it occurs, is followed by a random reward (or penalty) c* 
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with an a priori unknown expectation . The expected payoff at stage S iy after 
the decision D k is made, is given by 

*1 ■ 2 
y-» 

A policy is a vector of indices P * (A,, . . . ,A 7 ), which determines what 
decision should be made at each state: for every l, A, is an 
integer lying between l and M y and at state S decision D* should be 
made. 

Suppose that some fixed policy P = (A|,..., A 7 ) is maintained. The 
system then constitutes a Markov chain with transition probabilities 

Pr{5[/»+ I] -S y |S[«] 

Henceforth it is assumed that for every policy P y the corresponding 
Markov chain is ergodic. Denote by 7r\ p \ . . . , 7r 7 (#>) the steady-state proba- 
bilities of this chain, that is, 

7r f (/>) = lim Pr {S^/i] =* 5^}, 
n— *ao 

The problem is to find a policy P for which the expected payoff, 

r<'> 

»«= 1 

is minimal. 

There are N M 1 possible policies. For each policy P m «= (A J", . . . , A? 1 ), 
m *» l, . . . , N, let r m = r iFm \ The problem is therefore to choose the policy 
with the smallest value of r m . 

The regenerative process {X m (t) : t >0), corresponding to the policy 
P m , is the Markov chain whose states are S,,...,S 7 and whose transition 

probabilities are •n* L j \ i,j = I /. The regeneration times /J", n *■ 

0, 1,2,..., for this policy are the times of visiting a certain fixed state, say 

S x . ' ' 

Since the algorithm (6.5.9)- (6.5. 16) does not require any a priori in- 
formation about the regenerative processes {X m (t ) : / > 0), m~ 
or about the values of r, f . . . , r N , it can be applied for finding the optimal 
policy for the Markov decision process described above. 


6,6 THE REGENERATIVE METHOD FOR CONSTRAINED 
OPTIMIZATION PROBLEMS (62J 

In this section we consider an algorithm for solving a linear programming 
problem, whose coefficients present some unknown characteristics of re- 
generative processes. 
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Let us consider the following linear programming problem: 

s 

m*nr 0 (p) = min 2 E {M *')}/>/> (6.6.1) 

p p i-i 

subject to 

rAp)=2E{fj(X‘)} Pi <0, 
i- 1 

/V 

p = (pi p N )• 2/\“ ! - 

i-i 

Here X', i=l,2 N, are the steady-state r.v.’s of the regenerative 

processes {A "'(/) : / > 0}, / = 1,2, ... ,7V; the functions^, j = 0,1 M, are 

real measurable bounded functions defined on the ranges of these processes. 
E{/ ( ^X i )} can be viewed as a performance index of the fth system, 
1=1,...,*. 

We assume that the values E{fj(X')}, / = 1, ...,*, 7 = 0,1 M, are 

unknown a priori; therefore the standard simplex method for solving this 
linear programming problem cannot be applied. Our solution for this 
problem is based on the penalty function given below and the regenerative 
approach studied in the previous sections. Before we start solving this 
problem let us note that, if we drop (6.6.2) in the linear programming 
(LP) problem (6.6.1)- (6.6.3), then the problem (6.6.1)- (6.6.3) is identical 
to the problem (6.5.3)- (6.5.4), which is of course the same as the problem 
(6.5.2). The problem (6.5.3)- (6.5.4) is referred to as an unconstrained 
problem (UC) and is therefore a particular case of the constrained LP 
problem (6.6.1)- (6.6.3). 

We start solving the problem (6.6.1)- (6.6.3) by introducing the following 
penalty function: 

L(p,p)= 2*U(*')}P,+ 2 y([ 2 E{fj(x‘))p] ) . 

.-I ;-l ^ Ui-I J / 

(6.6.4) 

where ft J >0,j = M. The operator [ • ) + is defined by 

Z] + = ( Z ’ for Z> 0 (6.6.5) 

J lO, for Z<0 V ' 

Now instead of the original LP problem, the following problem is solved: 

infL(p,ju y [n]), (6.6.6) 

^ n — *oo 

where p satisfies (6.6.3) and the sequences {n>( n )}*-i. J = 1 N, satisfy 


( 6 . 6 . 2 ) 

(6.6.3) 
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the following conditions: 

M y [n]>0 n’[n]<iij[n]<fi"[n] 

fi'[n]too, ii[n] = (Mi[ «])- 


Now we propose an adaptive algorithm that converges with probability 
one to the optimal solution of the LP problem (6.6.1)-(6.6.3). 

The algorithm is similar to the algorithm (6.5.8)-(6.5.16) and is based on 
a step-by-step correction of the probability vector p [n], where n denotes 
the step number. As in the algorithm (6.5.8)-(6.5.16) there exists a mecha- 
nism, provided by (6.6.12) below, that ensures that p\n] > e [ n ], i — 1 , . . . ,jV, 
where { e [«]}". 0 is a monotone decreasing sequence of positive numbers, 
subject to (6.6.16) through (6.6.21) below. On the nth step the / th system, 
/ € { 1, . . . , N) 7 is chosen by simulating the distribution p\n — 1]. We denote 
this event by X[ri) = X‘, One cycle of the process {A"(f) : t > 0) is carried 
out. Denote by v'\n] 9 i — 1, . . . the total number of cycles made by the 
r th system up to and including the wth step. We check whether or not the 
inequality v k [n - 1] > ne\n], k G { I, . . . , i — l,i + 1, . . . , jV), is satisfied for 
all systems. If for some indices /c,, . . . ,/c s { 1, . , . — 1 ,i + 1, ...» A) this 

inequality does not hold, then one additional cycle is carried out for each 
system so that ultimately 


holds. 

We record also 


v k [ n] > rte[ n ] , 


nrk 


T k 


k~\,...,N, 

k = i,k l ,...,k s , 


( 6 . 6 . 8 ) 


(6.6.9) 


the lengths of the cycles performed, and for each k calculate M + 1 
numbers 

Y n kJ = ( T:k " ] f J (X k (t))di, k - y = 0, 1 A/, 

( 6 . 6 . 10 ) 

if the process {<¥*(/) : / > 0) is continuous-time. 

In the case of discrete-parameter processes the integral should be re- 
placed by the corresponding sum over the (F*[/i])th cycle. Set also 

YV-rt-O, i{k¥*i,k u ...,k s J = Q 9 1,...,M. ( 6 . 6 . 11 ) 


The new distribution p\n] is updated according to the following recurrence 
formula: 


P[ n ] - •] - y[n]B{n\i)}. 


( 6 . 6 . 12 ) 
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Here S’, is a simplex in R N : 

j />-(/> l|, 

tt s is the projection operator onto the simplex S t > such that for any 

z'gr", 

\\Z-,T St (Z)\\ = min \\Z-y\\, 


and B(-|') is a vector (5,(>|-). • • • »£*(' I ’))> where 




£**, («!') = - 


Pi[ « ~ 1 ] 

g(«l') 


(N- 1 )p,[ n — l ] 


M 


g(n\i) = r i0 [n] + 2 
y-i 


2 > W"“ 0 


(6.6.13) 

(6.6.14) 

r ij[n] 

(6.6.15) 


T*["J 

r*[n] -t*[h- 1] +r*. A: ■ 1 /VJ-0,1 M. (6.6.16) 

The initial values of p[0]eS e | 0) , T[0], and r [OJ can be chosen arbi- 
trarily. In the above, the sequences 


{rM)“.„ {'MC-g. (m'[ and 1 

must be chosen in such a way that the following conditions are satisfied: 


rMiO. «[n]iO 

(6.6.17) 

2 y[«] “ 00 

n *» 1 

(6.6.18) 

2 (y[«]m"['»] 2 «" , [« — i])< °° 

n -• 1 

(6.6.19) 

2 n" l/2 y[n]/i"['»]e' , [”] < oo 

/I- 1 

(6.6.20) 

2 Y[«](ft'[«])~ 1 < oo 

n ~ 1 

(6.6.21) 

go 

2 y["] e [>] < oo 

(6.6.22) 


n* l 
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Remark I In order to satisfy conditions (6,6.17) through (6.6.22) we can 
take, for example, 


y[n]~~n ° 2 , 




Remark 2 We assume that r n ' > t 0 > 0, i = 1, . . . , N 9 n — 1,2,..., that is, a 
cycle will be taken into account if it is of some minimal length (which can 
be considered as the sensitivity threshold of the measuring instrument). 

Remark 3 The r.v.’s Y kj [n] and k = l, . . . 0, 1 Af, n> 1, 

defined in (6.6.16), accumulate the information obtained up to and includ- 
ing the npth step. It is worth noting that, for each fixed k> only v k [n] 
summands in both Y kJ [n] and r k [n) are nonzero. 

Now we formulate a theorem, which is proven in Rubinstein and 
Kamovsky [62], 

Theorem 6.6.1 If the values of the functions jT, y * 0,1,..., Af, are uni- 
formly bounded by some constant D and if there exists the unique optimal 
solution p* of the LP problem, then for any initial distribution p[0] € 5 «|0| 
the sequence {/>[rt]}^ , generated by the algorithm (6.6.7)-(6.6,22) con- 
verges with probability 1 to p*. 

Corollary 1 Since the UC problem (6.5.3)-(6.5.4) is a special case of the 
LP problem (6.6. 1)— (6.6.3), the algorithm (6.6.7)- (6.6.22) solves the UC 
problem as well. 

Corollary 2 The theorem remains valid if we assume that the values of the 
functions fj cannot be observed directly, but can be measured with a 
random noise. In other words, 

>-(*') ~£ { {<2,(X\£)}. i = JV , j = 

where £ is a random vector with an unknown time-independent probability 
distribution function. In this case we can consider another random pro- 


cess: 


ItiX 1 )} = E{E t {Qj(X',t))} = EiQjiU 1 )} 
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6.7 VARIANCE REDUCTION TECHNIQUES 

In Chapter 4 we studied several variance reduction techniques — namely: 
correlated and stratified sampling, antithetic and control variates — for 
estimating integrals, the mean waiting time in the GI/G / 1 queueing 
system, and the expected completion time in networks. Here we deal 
further with variance reduction techniques for estimating some output 
parameters of the steady-state distribution of regeneration processes. To 
understand how expensive simulations can be, consider estimating, via 
simulation E[W], the expected stationary waiting time in an M/M/\ 
queue. Usually, we would not simulate an M/M/\ queue since analytic 
results are available. However, despite its simplicity it can be very expen- 
sive to estimate E[W], It is therefore a good candidate for testing simula- 
tion methodologies. Let the traffic intensity p < 1; then W N , the average of 
the first N waiting times, has an asymptotically normal distribution with 
mean E\W] and variance o 2 /N. Therefore a confidence interval for E[W ] 
may be constructed. 

A major problem in any simulation is how long to run it. One possibility 
is to run the simulation until the half length of a prescribed confidence 
interval. Table 6.7.1 lists the run lengths needed for the M/M / 1 queue to 
have a half length of 0.10 E(W). It follows from this table that as p 


Table 6.7,1 Samples Sizes for the M/M/l Queue Required 


p 

E(W) 

a 1 

N 

0.10 

0.1 11 

0.375 

8,200 

0.20 

0.250 

1.39 

6,020 

0.30 

0.429 

3.96 

5,830 

0.40 

0.667 

10.6 

6,430 

0.50 

1.00 

290 

7,850 

0.60 

1.50 

88.5 

10,600 

0.70 

2.33 

335 

16,700 

0.80 

4.00 

1,976 

33,400 

0.90 

9.00 

35,901 

119,000 

0.95 

19.0 

607,600 

455,000 

0.99 

99.0 

3.95 x 10 8 1 .09 x 10 7 



Source: Ref. 24. 

Note: N = Number of customers that must be simulated for a 
90% confidence interval for E\W] to have a half length of 0,1 
E[Wl 


i Vp(p-X), 
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increases beyond 0.3 the required run lengths increase rapidly, and for 
large values of p simulation is no longer a practical method. 

In the following two sections we consider control variates and common 
random numbers (correlated sampling) techniques for variance reduction 
while simulating stochastic processes, and we give some practical recom- 
mendations for their application. The results of these sections are based on 
Heidelberger [24], Heidelberger and Iglehart [23], and Lavenberg, Moeller, 
and Sauer [45], and are reproduced mostly from them. 


►*( 0 , 1 ) 


as jV— > 00 , 


6,7.1 Control Variates 

The method of control variates has already been described in Sections 
4.3.3 and 4.3.12, and is only reviewed briefly here. 

Let {A^,* > 0} be a sequence of i.i.d. random variables with unknown 
mean r = E{X n ). We are interested in estimating r via simulation. Let 
o* =*= a 2 ( X„) be the variance of X n . We can estimate r by 

v 

2 

77 n - 1 

X " “ ~N~ 

and then form a confidence interval by using the c.l.t.: 

\fN (* v -r) 

<*, 

Suppose now that we have another sequence of random variables {C„,n > 
0), called control variates, such that C n ’s are i.i.d., that X n and C„ are 
correlated (usually achieved by simulating X n and C m with the same stream 
of random numbers) and that r c ~ E(C n ) is known. Let /? be some constant 
and set 

Z n (P )-X*- 0(C„-r c ). (6.7.1) 

Then {Z n {P),n > 0} are i.i.d. with mean r and some variance denoted by 
o 2 (0). Let 

2 zSP) 

= ^ — ; 

then by the strong law of large numbers Z^(/?)— ►/* a.s. as N —>co and, by 
the c.l.t., 

VN{Z„{p)-r) 


o(P) 


>N( 0,1) as jV— > oo. 


(6.7.2) 


It can be readily shown (see also Section 4.4.3) that which 
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minimizes the variance a 2 (/?), is equal to 

cov(X n ,C n ) 

° 2 (Q) 

and that 

o 2 (P*) = (l-p 2 (X n ,C n ))o 2 . (6.7.3) 

Formula (6.6.1) can be extended easily to the case of more than one 
control variate. Indeed, let C = (C p . . . , Cq) be a vector of Q control 
variates, let r f * (r,, . . r Q ) be the known mean vector corresponding to C, 
and let 0 be any vector. Then 

(6.7.4) 

is an unbiased estimator of r. 

Another type of control variate C — (C,, . . . , Cq) is one for which the 
vector E(C) is unknown but its components E(C q ), q * 1, . . . , Q> are equal 
to r. In this case 

o*+2ACJ. (6-7.5) 

i-i 

with Sp.oA = l, is again an unbiased estimator of #\ 

We now consider two examples of application control variates for which 
formulas (6.7.4) and (6.7.5) are applied and variance reduction is achieved. 
The first example deals with (6.7.5); the second with (6.7.4). 

Example 1 Let [X Hf n > 0) be an irreducible, aperiodic positive recurrent 
Markov chain with state space / * {0, 1, 2, . . . , } and transition matrix 
P = {PijtijGl). It is known from Section 6.2 that X n =>X as rt— ►<»(=> 
denotes weak convergence), where X is the steady-state random variable 
having the stationary distribution tt — {?r r : / £ /} and tt can be found from 
the solution of the system of linear equations w = irP. 

Let /:/->/? be a real-valued function on / and define 

r- £{/(*)} -w/«2 *,/(/). 

»G/ 

Here 2 ie /7r,/(/) is the inner product* of v and /. We are interested 
in estimating r. If the matrix P is unknown or the state space 1 is large (i.e., 
it is difficult to solve 7 r =* ttP), it may become necessary to estimate 


For simplicity wc use this form rather than the more conventional 
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via simulation. This can be done as follows (see also (6.2.1) through 
(6.2.4)): 

Pick some state in I, say 0, and set T 0 = 0. Define 

7 'm ==inf {"> T m-\ 'K 3 * 0 }’ ™>0. 

We say that a regeneration occurs at time T m and the time between T m and 
T m+l , that is, T m = T m+ , - T m , is referred to as the length of the m cycle. 
Let k be some positive integer and let r r * irf r = E{f r (x)}, v * 0, 1, . . . , k. 
For each m > 0 and v — 0, 1, . . . , k, define by 

n-T m 


It follows from Proposition 6.2.2 that, if 77|/ r | < oo, then 


Let ZJv)=YJy)-r r T, 
each m > 0 

Define 


r , = */, : 


EjYjjp)) 

E ( r m) 


By (6.7.6) we have for each v 


E(Z m (v)) = 0 


r r (M) 


M 


2 UO 


M 


2 r m 


(6.7.6) 
0, and 

(6.7.7) 


(6.7.8) 


and 


XJLN) 


s 


2 /,(*„) 
n — 0 


N + 1 


(6.7.9) 


for each v =>0, 1,. . .,k. 

Then a.s. as M-*<x> and X y (N)^>r y a.s. as N-*oo. Observe 

that r r (M) is an estimator for r r based on M cycles of the process and 
X„(N) is an estimation for r v based on N transitions of the process. 
Because {Z^Oim^O} are i.i.d., it is readily possible to prove the 
following two c.l.t.’s: 


KM (r.(M)-rJ 
o/E( T,) 


o/E( t,) 1/2 


A^(0, 1) 

*(0, 1) 


as M — > oo 
as jV-*oo 


(6.7.10) 

(6.7.11) 
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Proposition 6.7.1 Let X* be a (k + l)x(/c+ l>dimensional covariance 
matrix of Z m (v)\ whose (ij)th entry is a SJ = E[Z m (i)Z m (j)l W E(\f y (x)\) 
< oo for each v = 0, 1, ...» Jr, then 

[ VM (r 0 ( A/) - r 0 ), . . . , VM (r k (M) - r*)] ) 

(6.7.12) 

[ VN ( X 0 (N) - r 0 ), . . . . VN ( X k ( N) - r*)] ). 

(6.7.13) 

The proof of this proposition is given in Ref. 24. 


Now let /3 be a (k + l)-dimensional row vector of real numbers whose 
yth entry is Let r,f(A/) t and x(N) denote ( k + 1 )-dimensional 

column vectors whose ?th entries are r vy r p { M ), and x ¥ {N\ respectively. A 
simple application of the continuous mapping theorem (Theorem (5.1) of 
Billingsley [1 J) yields the following. 


Proposition 6.7.2 Let o* 2 (0) = 02*0' = 2*_ o 2*_ o 0(/>, y 0O'). Under the 
hypotheses of Proposition 6.7.1, 


VM (0r(A/)-0r) 

o*(0)/E(t,) 


N( 0 , 1 ) 


as A/ oo, 


(6.7.14) 


and 


«.0)/£(r,) ,/ ’ 


as A— > oo ? 


where 0r = XjL o 0( r K the inner product of /3 and r. 


(6.7.15) 


In order to form confidence intervals for the rj s (or for linear combina- 
tions of the r r ’s) it is necessary to know the a,/ s as well as E(t 1 ). These 
constants are usually unknown and must be estimated. In addition p may 
be a fixed, but unknown, vector so it too must be estimated. The following 
proposition, the proof of which is also given in Ref. 24, tells us that we 
may replace these quantities in Proposition 6.7.2 by any sequence of 
strongly consistent estimators preserving the asymptotic normality. 


Proposition 6.73 Suppose that f,( A/)-»£( t,) a.s,, that a.s. 

for each i and j, and that /?(/> A/) — >/?(/) a.s. for each /. Let z, k {M) be the 
matrix whose (/,j)th entry is 6 iJ (M ), let fi(M) be the vector whose / th 
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component is and let & k (f } , M ) » Then 

VM{0(M)i(M)-KM)r) hnn ^ ,, ,*„ ICX 

^ =>N(0, 1) as M — ► oc. (6.7.16) 

We turn now to the problem of choosing the functions f w with a view to 
achieving variance reductions. 

Heidelberger [24-26] suggested several ways of choosing f y , v =» 0, . . . , k. 
We consider only one of them [24]. 

Let 

/,-/>7.'-0,l, (6.7.17) 


where P 9 is the v step matrix function of the process. It is shown in Ref. 24 

that in this case, that is, when /„ = P w f 7 all r v = irf y? v «* 0, 1 k, are equal 

to r, and if £{|/(*)|) < oo, then t rf « ir(Pf)> Since r r — r, r * 0, 1, . . . ,/c, it 
is obvious that 


2 n» 

— 

2 r m 

m-i 


gOW) 

£(t.) 


7r/ r = r. 


(6.7.18) 


Therefore each r,( Af ), p * 0, 1, . . , , k, is a strongly consistent estimator for 
r, and we can use one of them for this purpose. However, better results can 
be achieved by using all of them simultaneously, for instance, using (6.7.5), 
which can be written as 

k 

*(">- 2 O, (6.7.19) 

*-o 

where XjLo/H*') * I- Variance reduction can be achieved if we choose the 
0(v)'s so as to minimize the asymptotic variance a*(/J) of r^(A/). Mathe- 
matically it can be written as 

minimize a k (0) — 0E0' (6.7.20) 

k 

subject to (6.7.21) 


The solution of this problem, which can be obtained using Lagrange 
multipliers, is 


P* — 


eS ' 1 


(6.7.22) 


«* 2 (/8*) 


1 

e1~ l e‘ ’ 


(6.7.23) 
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where e denotes the (k 4* l)-dimensionaI row vector each of whose compo- 
nents is 1, and where t is the transpose operation. Formulas (6.7.10) and 
(6.7.11) can be now rewritten as 




=>N(0, 1 ) 


(6.7.24) 


VN(X 0 .(N)-r) 


=>N(0, 1 ), 


where Xp(N) = Y£„ 0 f}{v)X w {N\ and both 

r(M)-+r a.s. as M — >oo 


and 

Xp(N)—>ra.s. as/V— >oo. 


(6.7.25) 


Since the covariance matrix 2 is in general unknown, it is necessary to 
estimate it. If 2(Af) is any estimator such that j£(A/)— *2 a.s. as A/— ► oo, 
then it is clear that 2 _t (A/)-*2 a.s. as A/~*oo. Letting 

k 

(6.7.26) 


and applying Proposition 6.7.3, we have 


VA/(r 4 .(A/)-r) 

6 k (fo,M)/T t (M) 


=>Mo, l), 


(6.7.27) 


where d k (0Z, M)-+o k (P*) a.s. as A/-*oo and fj(A/> is any sequence of 
numbers such that f,( A/)— >£(r,) a.s. as A/~* oo. A corresponding c.l.t. 
exists for the ^ r (Af)’s as well 

This method is called the “method of multiple estimates' 1 because it 
combines several different estimates of the same quantity. 

In order to apply this method the functions^ must be computed (usually 
before the start of the simulation). For computation efficiency f v can be 
defined recursively by f 0 =/ and f p = Pf y _ , for v > I. This saves having to 
compute the v step transition function P r , a potentially large computa- 
tional economy. If the stale space is finite and the transition matrix is 
sparse, the work involved in calculating f r for a few values of v may not be 
too heavy. 

We note that to form the estimates x y (N) (or r„(A/)) we must evaluate 
fp(X n ) for each value of v and each transition n. This tends to increase the 
amount of time needed for each transition simulated. However, if the 
variance reduction obtained is sufficiently large, the potential savings in 
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the number of transitions that need to be simulated will more than offset 
the extra work per transition. We also note that additional work must be 
done at the end of each cycle to update the estimates of the covariance 
matrix 2* (using no variance reducing technique, we need only update o £ ). 
It is shown (see [24]) that a*(P*)-*0 as ►oo. For many types of Markov 
chains we can expect substantial variance reductions even when k is 
relatively small (say 2 or 3). For countable / we have 

/*(')= £ P, k jf(j) = E[f(X»+*)\X n = i]- (6.7.28) 

J - o 


Thus if the Markov chain makes transitions only to “neighboring” states 
and if /(» is close to /(/) for j close to #, it can be seen from (6.7.28) that, 
for small k , /*(/) and /(/) should be nearly the same. This means that 
x k (N) and x 0 (N) will be highly correlated, a condition that generally 
results in good variance reduction. Many queueing networks exhibit this 
special type of structure. 

Ideally, we would like to be able to have the “optimal” value of k in the 
sense that, for a given computer budget, we would like to pick the value 
that yields the narrowest confidence intervals for r (part of the budget 
must be allocated to calculation of the //s). To perform such an optimiza- 
tion we would have to know a K 2 (0*) for each v > 0. These quantities are 
generally unknown, and even to estimate them would require calculating 
the jr/s and then simulating the Markov process for an additional number 
of cycles. The disadvantage of such a procedure is that the cost of 
computation of f v may be higher than the gain achieved through variance 
reduction. Generally speaking, the success of this technique depends on 
our ability to compute and store efficiently the functions f p . 

The method of multiple estimates can be extended to certain types of 
continuous-time processes such as continuous-time Markov chains and 
semi-Markov processes (see [24]). 

To find out the efficiencies of this method Heidelberger [24] considered 
the following four examples: the queue length process in a finite capacity 
A//A//1 queue, the queue length process in the repair problem with spares, 
and the waiting time processes in both M/Af/1 and M/M/2 queues. 
These processes were chosen because analytic results are readily available, 
thereby making a comparison between analytic and simulation results 
possible. Despite their simplicity, these processes are by no means “easy” 
to simulate, in particular the heavily loaded queues, which require very 
large run lengths to get good simulation estimates. The simulation results, 
which are also presented in Ref. 24, show that for all four examples 
substantial variance reduction was obtained. However, as this method 
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entails additional computations both before and during the course of 
simulation, we would recommend using it only when it is computationally 
advantageous to do so. In the case of Markov chain it is likely that the 
method will be most effective if the transition matrix of the process is 
sparse, in which case the preliminary calculations can be carried out with 
relative ease. It is for this type of process that the method is recommended. 


Example 2 We consider now another example of variance reduction, 
taken from Ref. 45. Before starting this example we need more mathemati- 
cal background on the regenerative method. 

Let X again be the steady-state vector of the regenerative process 
{*(0 : t > 0}, let / and g be given real-valued measurable functions, and 
suppose we want to estimate 


£{/(*)} 
£{g(X)} ’ 


(6.7.29) 


It follows from Proposition 6.2.2 that, if £{|/(A')|} < oo and £'{|g(x)|} < 
oo, then 


EQO _ g(r) 

E{Z t ) £(Z) ’ 


«= 1 , 2 ,. 


(6.7.30) 


where 

y, = 

J T> 


and 

J T, 

are dependent random variables defined with respect to a single cycle 
T i — T i+ , - T r In the particular case where g — 1, we have Z, ? =* r, and 
(6,7.30) becomes (6.2.4). The classical point estimator for \x. obtained from 
M cycles is 

M 

2 Y, 

M = , /* 1,2, (6.7.31) 

2z, 


and for sufficiently large M 

Vm(m-m) 


O 


7V(0, 1), 


(6.7.32) 
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where 


a 2 = 


Var 

E(Z.) 


» = 1,2, .... 


Furthermore, if we replace a with its estimator 6 such that 


M 


(l/( A/ — l))S(>',-MZ,r 


a 2 = 




((|/W)2z,j 


(6.7.33) 


the c.l.t (6.7.32) will also hold; therefore a confidence interval for jx can be 
obtained. 

Assume now that we have Q pairs of dependent random variables 
{Y iq \Z iq) }, q — defined with respect to a single cycle. Denote 


^ E(Y iq) ) 
ttq £(Z (,) ) 


(6.7.34) 


Assume also that (x , q — 1, . . . , 0, is known, but that the expected values 
of the pairs {K***, Z <<? *} are unknown. In order to apply control variates in 
this case the sequence of i.i.d. pairs of random vectors 


(< r„z,). ( r!",z">) z«»)) , „-i « 


(6.7.35) 

is collected, and then the ^-dimensional vector of control variates C ** 
(C,, . . . , C Q ) is defined as 


2 K <v> 

C — d=\,....Q. (6.7.36) 


2 


Now, by analogy with (6.7,4), for any vector /B a point estimator for /x using 
these control variates is 


AO) =/i-0'(C -p c ), (6.7.37) 


where p c =* (p,, . . . ,p e ). Note that because g and C q , q= 1 Q, are 

biased estimators, respectively, for ju and fi q , q= Q, the estimator 
/i(0) is also biased, which differentiates it from the unbiased estimator for 
Z„(0) in (6.7.4). However, /i(0) is a strongly consistent estimator of n and, 
for M sufficiently large. 


VMpMp) - p) 

ct((3) 


MO, 1), 


(6.7.38) 
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where 


0 

2fn) \y-vZ £ /?,(r‘*>- M „z<")‘ 

(P, - Va 1 *12] £[Z<.»] ]■ 

(6.7.39) 

The value of 0 that minimizes a 2 (0) is (see (4.3.30)) 



P*-Z _ V. 

(6.7.40) 

where the matrix 2 and a have elements 


and 

)<p COV [ £[Z] ’ £[Z^>] 

(6,7.41) 


* COV [ £[z] ’ £[z<’>] 

(6.7.42) 

The resulting 

minimum value of cr 2 (0) is 


where 

a 2 0*) = (l -/? 2 )o 2 . 

(6.7.43) 


v ar[A] ’ 

(6.7.44) 


Finally, for M sufficiently large 

'fM (Kfi*) -m) 
o{P*) 


where 0* is an estimator of 0* and d 2 (0*) is an estimator of o 2 (0 # ). As M 
increases d 2 (0*)/d 2 approaches 1 ~R 2 , and therefore variance reduction 
can be achieved. 

Now we start with the example given in Ref. 45. Consider a G//G/1 
queue with i.i.d interarrival times A 4 and i.i.d, service times S r Let p 2 be 
the mean interarrival time. Assume that the traffic intensity p ~ fij/Pi < 1; 
this means that the queueing time {**',/ > 0}, which is defined by W i ~ 
(W ^ , + S t _ , - A f )+ , i> l, and is a regenerative process with 

regenerative points (T k ,k « 1, 2, , . . }, where T k is the serial number of the 
k th customer that arrives to find the system empty and T x * 1 (consult 
Section 6.4.1). 

The steady-state waiting time E(IV ) « ju can be estimated by 

M 

lY t 

f- I 

2 Z r 

/-I 
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where 

i 

y,= 2 w, 

J-T, 

and Z- — 7j + , - 7]. Define 

T. + i-l 

r m - 2 'V 

the duration of the / th busy cycle (busy period plus idle time), and 

= 2 s,, 

J-T, 

the duration of the th busy period, ft is known [45] that — ti x E(Z) 

and £(K ( 2 ) )-/i 2 £(Z), where £(Z) = £(Z,), i- 1,2 

The following vector of control variates C = (C,,C 2 ) with components 
(see also (6,7.36)) 


2 >2 €) 


q- U2, 


is considered in Ref. 45 and the point estimator ft(/3) given in (6.7.37) is 
adapted for the parameter ^ E(W). It is shown numerically in Ref. 45 

that substantial variance reduction is obtained by simulating the GI/G/l 
queue and some other queueing models, while using these control variates. 


6.7,2 Common Random Numbers in Comparing Stochastic Systems [23] 

In this section we show how the method of common random numbers 
may be used in simulation of discrete and continuous Markov chains for 
variance reductions. 

Suppose we have two irreducible, aperiodic, positive recurrent Markov 
chains in discrete time and we wish to construct a confidence interval for 
r i “ r 2 * £{/i(^ 1 )} “ EiJiiX 1 )} fey simulating the two processes. Here X* % 
i ** 1,2, is the steady-state r.v. of the regenerative process X 1 = {XI : t > 0) 
and the / are given real- valued functions defined on the state space /, of 
process X'. 

Let us consider the following two point estimates of r.: 

o/«)£ ri 

r' = 

*= i 


(6.7.46) 
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and 


N 


N- 1 


2 

k-0 


(6.7.47) 


where * is the number of simulated cycles, N is the number of steps, and it 
is assumed without loss of generality that T 0 ' — 0, Xq = 0, i » 1, 2. The two 
c.l.t/s are the following: 

t/2[ 

(6.7.48) 

° ( /MW} 


N' /2 [r' N -r i } 

°) n /E o{W} 


M0,1) 


(6.7.49) 


as n and N—> oo. 

To construct a confidence interval for r 1 — r 2 we can simulate the two 
processes X 1 and X 2 independently and apply the bivariate c.l.t. 

/V ,/2 [r w - r]=*N(0,A), (6.7.50) 

where f N = (r^, r*), r ■= (r‘,r 2 ), N(0, A) is a two-dimensional normal vector 
with mean vector 0 = (0, 0) and covariance matrix 


MW} 

0 


MW) 


It can be readily shown (see [23]) that 

a 


>N( 0 . 1 ), 


(6.7.51) 


where 

, o , 2 o 2 

a 2 = — + - — . 

MW} MW} 

A c.l.t. similar to (6.7.51), but based on simulating m cycles, can also be 
obtained to construct a confidence interval for r, - r 2 . 

Now we turn our attention to the problem of using common random 
numbers while generating sample paths for X 1 and X 2 . Our goal in using 
common random numbers is to produce a shorter confidence interval for 
r, - r 2 for the same length of simulation run. In other words, we seek a 
c.l.t. similar to (6.7.51) but with a smaller value of o. To accomplish this we 
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generate the bivariate M.C. X = { X n : n > 0}, where X„ — ( , X„ ). At 
each jump of the process X the same random number is used to generate 
the jumps of the two marginal chains X 1 and X 2 . The marginals of the 
process X are seen to have the same distributions as the original chains X 1 
and X 2 ; however, the marginal chains are now dependent. The state space 
of the chain X is denoted by F which is a (possibly proper) subset of 
/, x I 2 . We assume here that the chain X is also irreducible, aperiodic, and 
positive-recurrent. (These conditions are not automatic but usually hold 
for practical simulations.) Furthermore, we assume for convenience that 
(0,0) e F and use that state to form regenerative cycles. Note that X„=>X 
as n— >oo, and the marginal distributions of X are the same as those of X ] 
and X 2 , namely, { 71 y(i):ye/) for i ■* 1,2. For any real-valued function 
/: F— satisfying E{\f(X)\) < oo, the regenerative method can be ap- 
plied to X to estimate E{f(X )). Let X 0 — (0,0), T 0 = 0, and define the mth 
entrance to state (0, 0) by X to be 

7'*+i s= «nf{n> T„ :^r.-(0,0)}, m > 0. 

Also, let r m * + j - T m , m > 0, be the length of the mth cycle and 

r m + , - 1 

r m (»- 2 MX,), m >o. 

n - T m 

Set Z' m (i) = Y^(i) ~ Since the ratio formula (6.2.4) still holds for the 
process X, E {0 0) {Z' m (i)} =0 for i = 1,2. Let 

°ij “ E{0.0){ z \(OZ't(j)), i,j - 1,2, 

which we assume is finite and nonzero. Since the vectors 7J m = 
(Z^(l), Z' m (2)) are i.i.d., the standard c.l.t. yields 

n 

n ,/2 2 7/ m =>yV(0,2), (6.7.52) 

m = 1 

where 2 = {o /y }. By analogy with (6.7.49) and (6.7.50) it can be shown (see 
(23)) that 

N ,/2 [ r N - r ] =*N(0, B), (6.7.53) 

and 



v 


Here B = o„/£ ( 0 > 0 ,{ t,} and 

2 _ ( q ll + °22 ~ ^ q l2) 
^(0,0>{ T ) } 


M0, 1). 


(6.7.54) 
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A c.i.t, similar to (6.7.54), but in terms of n regenerative cycles, can also be 
obtained. Now consider the marginals of (6.7.53) in conjunction with 
(6.7.49). Since the marginals of the chain X have the same stochastic 
structure as the chains X 1 and X 2 considered separately, these two c.l.t.’s 
must be identical. Hence 


— — - . (6.7.55) 

Eo{r[) E (0,0){ T l) 

Thus upon comparing the constant a 2 in (6.7.51) and v 2 in (6.7.54), we 
conclude that v 2 < a 2 if and only if o l2 > 0. 

The measure of variance reduction we use is 

R 2 = (6.7.56) 

V 

So, for example, if R = 2, then only half as many steps of the Markov 
chain X need be simulated to obtain a confidence interval of specified 
length for r, - r 2 as would be required when simulating X 1 and X 2 
independently. In addition, of course, only one stream of random numbers 
need be generated. While we have worked here with discrete-time Markov 
chains, the same method can be used for continuous-time Markov chains, 
semi-Markov processes, and discrete-time Markov processes with a general 
state space. 

The following definition and properties will be used in obtaining non- 
negative correlation. 

Definition 1 Random variables Y = ( . . . , Y„) are said to be associated 
if cov {f(Y).g(Y )} > 0 for all nondecreasing functions / and g for which 
E{f(Y)},E{g(Y)) and E{ f(Y ), g(Y)} exist. 

property 1. Any subset of associated random variables are associated. 

property 2. If two sets of associated random variables are independent 
of one another, then their union is a set of associated random variables. 

property 3. The set consisting of a single random variable is associated. 

property 4. Nondecreasing functions of associated random variables are 
associated, 

A class of processes for which nonnegative correlation can be guaran- 
teed is stochastically monotone Markov chains (s.m.m.c.). In the following 
definition let / be a fixed index. 
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Definition 2 Let X' = {X'„,n > 0} be a real-valued Markov process with 
initial distribution P,{x) = P { < jc) and transition function P,(x,A) = 
P{X n+x {i)^A\X„(i) c =x} (for measurable sets A). X' is said to be an 
s.m.m.c. if, for every y , P,(x,(- oo.y]) is a nonincreasing function of x. 

Define the inverse distribution functions P~ ’(•) and P~\x, -) by 


P~\u)~ inf {y: P,{y) > u) (6.7.57) 

P~\x,u) = inf {y : P t (x, ( - <x,y}) > u}. (6.7.58) 

Henceforth we assume that the sample paths of X' are generated on the 
computer, using the inverse transformation scheme 

Xo-rr'W ( 6 - 7 - 59 ) 

X: = P,-\X-_ { ,U n ), n> 1, (6.7.60) 

where {{/„,/? > 0} is a sequence of random numbers. 


Notice that, if X 1 is an s.m.m.c., then P t \x , u) is an increasing function 
in both arguments. This fact enables us to show that for each n > 0 
{ A'o 1 , . . . , X' n , Xq, ...,X*} are associated. 

Theorem 6.7.1 If X 1 and X 2 are both s.m.m.c/s with sample paths 
generated by (6.7.59) and (6.7.60), then, for each n > 0, {Xq, . . . , X„\ 
Xq, , . . , X„ 2 } are associated random variables. 

Proof The proof is by induction. For n =» 0 Property 3 implies that {{/ 0 } is 
associated and since /^(i/o) a nondecreasing function of U 0 for each i, 
yields that (X^Xq) are associated. Assume now that (Xj, . . . ,X„\ 
X 0 2 ,...,X 2 } are associated. Since U n + { is independent of this set, 
{X 0 \..., X„\ X 0 2 , . . . , X„ 2 , U n+l ) are associated by Property 4. The map that 
takes these random variables into {Xj, . . . , Xj, Xj + ,, X^, . . . , X n 2 , X 2 + ,) is 
nondecreasing because X 1 and X 2 are both s.m.m.c/s. Property 4 then 
yields the final result. Q.E.D. 

The following theorem, whose proof is found in Ref. 23, shows that, 
when simulating s.m.m.c.’s using common random numbers, a reduction in 
variance is achieved. 

Theorem 6.7.2 Let X 1 and X 2 both be s.m.m.c.’s with sample paths 
generated by (6.7,59) and (6.7.60). Let /, and / 2 be nondecreasing functions. 
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If 


E { r £) < W. 

(6.7.61) 

1 

2 U(^)|] <oo, for/ -1.2, 

(6.7.62) 

I) 



then a I2 > 0. 


The efficiency of common random numbers in variance reduction was 
checked for different output parameters of regenerative processes and 
substantial variance reduction was achieved only for some particular cases. 
The effect of variance reduction decreases with increasing complexity of 
the processes being simulated. The method is effective only where the 
expected cycle length is sufficiently short. If preliminary simulation runs 
indicate that the expected cycle length is excessive, it is suggested that 
independent simulations be performed. 


EXERCISES 

1 For the data given in Fig. 6.4. 1 construct a 90% confidence interval using the 
classical estimator 

n 

Z* 

/- i 

I 1 

where n is the number of cycles. 

2 Prove by induction that, if /„ «* P v f y where P r is the v\h step transition matrix, 
then r ¥ * m v is equal to r « 7 r/. Here tt is the steady-state distribution of P , From 
Heidelberger [24). 

3 Prove that, if w|/| < oo, then mj *» v(PJ). From Heidelberger [24J. 

4 Prove that the solution of the problem (6.7, 20)- (6.7.21) is (6.7,22)~(6.7.23). 

5 Consider the following system of linear equations; 

r-apy+z, 

where P is an (nXn) ergodic Markov chain with stationary distribution 7 7 — 7 rP y 
a < 00 . Prove that 


where r « < 7 r,/>. 
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6 inventory Model. Consider a situation in which a commodity is stocked in order 
to satisfy some demand. An inventory (s,S) policy is characterized by two postive 
numbers s and S with S > s. If the available stock quantity is greater than s f do not 
order. If the amount of inventory on hand plus on order is less than j, order to 
bring the quantity of the s stock to S. Let Xj denote the level of inventory on hand 
plus on order in the period i after ordering. Let dj denote the demand in period j; 
then the stock values 


*(*>-*,. «dj<Xj_ s 

J [ 5, otherwise 

define a Markov chain with state space /=*{j,s + 1,5}, where it is 

assumed that s < X 0 < S. As a numerical example let s « 2, S = 5, and 
{p(dj = 0)=$,p(d J = l) = l,P(d J = 2)=i,and Then ihe transi- 

tion matrix is 


P 


1 0 0 

1 1 0 

B B U 

2 l 2 

BBS 

3 2 I 

SIS 


6 

8 

5 

B 

3 

1 

2 

B 


(a) Find the stationary probabilities ?r f , i‘e/, analyically and by simulation the 
Markov chain, making a run of 1000 cycles. 

(b) Describe a program to simulate the regenerative process {X(n) : n > 0} in- 
cluding a flow diagram, a listing of the program, and the random number 
generator. 


7 M / M / 1 Queue. Run this queueing model for 2000 cycles. From the simulated 
data: 


(a) Fill out a table similar to the Table 6.4.1, taking the same parameters, that is, 
X = 5, p = 10, and the 90% confidence interval. 

(b) Describe your random number generator, a flow diagram, and a listing of 
your program. 

8 Repairman model with spares. Select the same parameters as in Section 6.4.2, that 
is assume n -* 10, m = 5, s * 4, p « 2, X * 5, and choose the 95% confidence level. 
Run the model for 500 cycles and, from the simulated data: 

(a) Fill out a table similar to Table 6.4.2. 

(b) Describe your random number generator, a flow diagram of your program, 
and a listing of your program. 
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CHAPTER7 

Monte Carlo 
Optimization 


Optimization is the science of selecting the best of many possible decisions 
in a complex real-life environment. The subject of this chapter is Monte 
Carlo optimization, a subject playing an important role in finding 
extrema — that is, minima or maxima of complicated nonconvex real-valued 
functions. We show how Monte Carlo methods can be successfully applied 
while solving complex optimization problems where the convex optimiza- 
tion methods (see Avriel [2)) fail. Before proceeding to the rest of the 
chapter, however, we explain what we mean by local and global extrema for 
unconstrained optimization. 

Consider a real-valued function g with domain D in R n . The function g 
is said to have a local maximum at point x* G D if there exists a real 
number 6 > 0 such that g(x) < g(x*) for all x E D satisfying ||jr - x*\\ < 8 . 
We define a local minimum in a similar way, but in the sense that 
inequality g(x) < g(x*) is reversed. If the inequality g(x) < g(x*) is re- 
placed by a strict inequality 

g(x) < £(**)> x e D y x # x*, 

we have a strict local maximum; and if the sense of the inequality 
g(x) < g(x *) is reversed, we have a strict local minimum. We say that the 
function g has a global (absolute) maximum (strict global maximum) at 
x*E:D if g(x) < g(x*) f |g(*) < g(x*)] holds for every x G D. A similar 
definition holds for a global minimum (strict global minimum). A global 
maximum at x * implies that g(x) takes on its greatest value g(x *) at that 
point no matter where else we may search in the set D . A local maximum, 
on the other hand, only guarantees that the value of g(x) is a maximum 
with respect to other points nearby, specifically in a 6-region about x*. 
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Thus a function may have many local maxima, each with a different value 
of g(x) y say, g(x®), j — 1, * . * , k. The global maximum can always be 
chosen from among these local maxima by comparing their values and 
choosing one such that 

g(**) ^ g(*,°), j= 1 

where 

x»e{xjj=\ k). 

It is clear that every global maximum (minimum) is also a local maximum 
(minimum); however, the converse of this statement is, in general, not true. 
If g(.r) is a convex function in R n and DC /?" is a convex set then every local 
minimum of g at x E D also a global minimum of g over D [2] . 


7.1 RANDOM SEARCH ALGORITHMS 

Consider the following deterministic optimization problem : 

max g(x) =g(x*) = g*, (7.1.1) 

where g(x) is a real-valued bounded function defined on a closed bounded 
domain DcR n . It is assumed that g achieves its maximum value at a 
unique point xV The function g(x) may have many local maxima in D but 
only one global maximum. 

When gU) and D have some attractive properties, for instance, #U) is a differ- 
entiable concave function and O is a convex region, then, as previously 
mentioned, a local maximum is also a global maximum and problem 
(7.1.1) can be solved explicitly by mathematical programming methods (see 
Avriel [2]). If the problem cannot be solved explicitly, then numerical 
methods, in particular Monte Carlo methods, can be applied. For better 
understanding of the subsequent text we describe an iterative gradient 
algorithm, assuming for simplicity that the set D = R n . 

According to the gradient algorithm, we approximate the point x* step 
by step. If on the i th iteration (< = 1, 2, . . . ) we have reached point jc # , then 
the next point , is chosen as 

1 == JC; + «,Vg(.< j ), a , > 0 (7.1.2) 

where 

dg(*i) dg(*J 1 

a*. I 


*g(*) 
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is the gradient of g(x\ where dg(x k )/dx k , k ~ 1, . . . ,/i, are the partial 
derivatives, and where a, > 0 is the step parameter. 

If the function g(x) is not differentiable or if the analytic expression of 
g(x) is not given explicitly (only the values of g(x) can be observed at each 
point xGD), then the finite difference gradient algorithm 

*,+ i =*. + a,$g(x i ) (7.1.3) 


can be applied. In (7.1,3) 

9g(*,) dg{x„) 






g(*l+fil>*2> -g(*l~ ftp *2 * n ) 

2/3, 

w g(*, *„ + A,)-g(-*i>-- •.*„-&) 


is the finite difference estimate of the gradient Vg(x,). 

Under some rather mild conditions (see Avrie! [2]) on g(x) and a,, the 
algorithm (7.1.3) converges to the local extremum x*. In the case where 
either g(x ) or the region D is nonconvex, the classical numerical optimi- 
zation methods fail. However, Monte Carlo methods, in particular random 
search algorithms, can be applied. 

If we assume, for instance, that g(,r) is a multiextremal function, then 
procedures (7.1.2) and (7.1.3) converge only to one of the local extrema, 
subject to choice of the initial point x 0 from which the algorithms (7.1.2) 
and (7.1.3) start. 

We consider several random search algorithms capable of finding the 
extremum x* for complex nonconvex functions. 

The random search algorithms have been described in many papers and 
books (see Ermolyev [9], Katkovnik [17], Rastrigin [28], and Rubinstein 
[31-36]), and successfully implemented for various complex optimization 
problems. We now consider several random search algorithms. 


Random Search Doable Trial Algorithm (Algorithm RS-I) 

*,+ 1 = *< + [ g ( x > + &-<> ~ *(*< - A s /)J s c > °> A > o. 

(7.1.4) 

According to this algorithm, at the / th iteration we generate a random 
vector Z i continuously distributed on the /i-dimensional unit sphere, calcu- 
late the increment (see Fig. 7.1.1) 

Ag ± ) ■ g(x, + A s i ) - g( X, - &S, ), 


(7.1.5) 
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Fig. 7.1.1 Graphical representation of the dou- 
ble trials random search algorithm RS-1. 


and choose the next point according to (7.1.4). It is not difficult to see that 
this algorithm generalizes the gradient algorithm (7.1.3). Only in the 
particular case where is taken in the direction of the gradient do 
procedures (7.1.3) and (7.1.4) coincide. 

Nonlinear Tactic Random Search Algorithm (Algorithm RS-2) 

x '+ ' • + J y - Si s n Y < z < ’ °< > °* ft > °’ CM ■*) 


where 


=£(*, + PiZi)-g(Xi) 


Sign K 


I. 

0 , 


if y, > o 

if Y, > 0. 


(7.1.7) 


According to this algorithm, we perform a trial step in the random 
direction Z, and check the Sign Y r If Y i > 0, then , - x t ■. + 

If Y k < 0, then x i+ , = x t and no iteration is made. 


Linear Tactic Random Search Algorithm (Algorithm RS-3) 
This algorithm contains the following steps: 

1 i < — 0, generate Z 0 . 

2 Calculate the increment 

Y i = g(x i + P i S i )-g(x i ). 

3 If < 0, go to step 6. 
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4 

■*<+ 1 = x i + J y,-i > a, > 0, Pi >0. (7.1 .8) 

5 Go to step 7. 

6 x Jt i4-/ + 1; generate 5,. 

7 Go to step 2. 

Thus if Y i > 0, we perform as many iterations as possible in the initial 
chosen random direction * , + if ^ <0, we generate a random 

vector Z i and perform only one iteration according to the nonlinear tactic 
random search algorithm RS-2, 

It is not difficult to see that search in the same direction versus choice of 
a new direction is subject to the shape of g(x ). The flatter the gradient 
lines, the more iterations will be performed according to step 4 and 
correspondingly the fewer iterations according to step 6. In the particular 
case where g(x) is a linear function, all iterations will be performed 
according to step 4 in the direction of the vector x 0 4- a 4- a a /J 0 ~ 1 Y 0 Z 0 , where 
Z 0 is the first random vector such that Y 0 > 0 and no iteration will be 
performed according to step 6. This is the reason why this algorithm is 
called a linear tactic random search algorithm. 

Optimum Trial Random Search Algorithm ( Algorithm RS-4) 

This algorithm comprises the following steps: 

1 Choose N > 1 independent random points x i + ftjZ ik on the sphere 
{x t + P i S i } 9 where Z- is a random vector continuously distributed on the 
unit sphere with realizations Z tk , k = 1, . . . , N. 

2 Consider the sequence of increments 

Yi^g(x i + p i Z ik )-g(x i ), k= (7.1.9) 

3 Set 

r 1 « = m ax(y n ,....y, yv ) (7.1. 10 ) 

and let denote the direction that has produced this maximum. 

4 The point jr ( - + ( is chosen according to the following iterative proce- 
dure: 


«, > 0, Pi >0. (7.1.1 1) 

Thus the next point , is chosen in the direction of the greatest 
increase Y t % of the funciion g(x) y that is, the vector corresponds to the 
trial optimal among those available. 
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Statistical Gradient Random Search Algorithm (Algorithm RS-5) 

This algorithm can be described as follows. 

1 Choose N > 1 independent random points x i + fi k z, ik on the sphere 

{*; + where is a random vector continuously distributed on the 

unit sphere with realizations Z ik , k — 1, . . . , N. 

2 Calculate the sequence of increments 

+ *« > N. (7.1.12) 

3 Set 

I N 

Kx-j; 2 YA- (7.1.13) 

1 A- 1 

4 The point x l + , is chosen according to 

*/ + , -x^afir'ViX' «,>0,A>0. (7.1.14) 


Thus given x t , the next point x, + , is chosen in the direction V iZ , which is a 
result of averaging the sample . . . , S iN weighted with their correspond* 
ing increments Y ik (7.1.12). In the particular case where N = n and 

= e ki 1 ,0 0, k - 

k 


we obtain the following finite difference gradient algorithm: 


where 


*H.\ Sa Xi + <*,%g(x i ) 


Vg(x) 


dg(*) dg(*) \ 

9*, 9x„ j 


(7.1.15) 


*(*i + 0 t .*2 x„)-g(x) 

fix 


g(* t ,...,x H + P n )-g(x) 

fin 


It is not difficult to prove that for a linear function the direction of V iZ , on 
the average, coincides with that of the gradient of g(x). This is the reason 
why the algorithm is called “statistical gradient algorithm ” 

Consider the following stochastic optimization problem. 

max E[$(x, M^)]= max g{x) = g(x*) = g*. (7.1.16) 

xED<zR n x^Dc.R n 

Here W) is a function of two variables, x and IV> x* is the optimal 
point of g(x) y which is assumed to be unique, and W is an r.v. with 
unknown p.d.f. We assume that at each point x6Z) only the 

individual realization of <#>(*, W) can be observed. 

It is clear that, if the p.d.f. f w (w) is unknown, problem (7.1.16) cannot 
be solved analytically. However, numerical methods can be applied. 
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One widely used numerical method for solving (7.1.16) is the stochastic 
approximation method. This method was originated by Robbins and Monro 
[30], who suggested a procedure for finding a root of a regression function 
measured with a noise. Kiefer and Wolfowitz [19] considered a procedure 
for finding x* in the optimization problem (7.1.16) where xER 1 . The 
procedures of Robbins-Monro and Kiefer-Wolfowitz were generalized by 
Dvoretzky [8], Hundreds of papers and many books have been written in 
the past 15 years about stochastic approximation, their convergence, and 
their applications. The reader is referred to Wilde [44] and Wasan [43]. 

We consider the following algorithm: 

x ,+ i = + H^), (7.1.17) 

where 


$(*(*, HO) 


W H ) \ 

3*. I 


ft (*1 ftl* *2’ ' ‘ ’ > > Wj |) ft(-* | > ^2 * * * * ’ i ^12 ) 

"l 2^ 

ft(*P*2 ■*» + Pn’ W^,)- ft(*, >*2 X n -fi nr W n2 ) 

W ■ 

is the estimate of the gradient Vg(x). 

It is readily seen that in the absence of noise, that is, when W 0, 
V</>(x, W)** vg(jc) and (7.1.17) coincides with (7.1.3). In addition, if the 
realizations of the noise are independent and E( W) *0, then WO is 

an unbiased estimator of Vg(x ). 

Proof of convergence of algorithm (7.1.17) to x *, subject to some 
conditions on the sequences {/?,.}* , and the function $(x, W\ can 

be found, for instance, in Dvoretzky [8], Gladyshev [13], and Wasan [43 J. 

It is not difficult to understand that the random search algorithm can 
also be used for solving problem (7.1.16). For instance, by analogy with 
(7.1.17) the random search double trial algorithm (Algorithm RS-l) can be 
written as 


, 1 " *• + Yp. [ ■ +(■ *< + A' ^ ’ w n- > - ♦(■ ~ AA ( 7 • » • 1 8 > 

We can see that, for the same reasons as the random search algorithm 
(7.1.4) extends the gradient algorithm (7.1.3), the random search algorithm 
(7.1.18) extends the stochastic approximation algorithm (7.1.17). 

Proof of convergence of (7.1.18) to x* can be found in Rubinstein [31]. 
In analogy with (7.1.18) we can adopt any of the random search algorithms 
RS-2 through RS-5 for solving problem (7.1.16). 
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7.2 EFFICIENCY OF THE RANDOM SEARCH ALGORITHMS 


The random search algorithms can be compared according to different 
criteria. Usually, they are compared according to their local and integral 
properties [28, 29]. 

Local properties are associated with a single iteration of the random 
search algorithm, integral properties-with many iterations. Comparing dif- 
ferent algorithms according to integral properties we usually define: 


1 The initial condition from which search starts. 

2 A set of test functions (linear, quadratic, parabolic, multiextremal, 
etc.) for which the extremum is sought. 

3 Some criteria that must be achieved during optimization. The follow- 
ing criteria can be used. Find an index k corresponding to the best 
algorithm among S algorithms available, such that: 

(a) 

nun £(||*<' ) -**||)«£(|x<*>-**|), 

J - 1 s 


where the number of iteration i is given, 
(b) 


max Pr(|| - x*\\ 

l 5 




(c) 


where / and e x are given. 


max 

j-l s 


P r (!s( x !' ) ) -$(**)( ^ £ g) “ p r(!«( x / (<:) ) -*(**)! ^ e *)> 


where * and e are given. 


(d) 


min i {3) = i ik \ 


subject to 


or 


j= 1, . . . , 5 

£(*<'>) e[£ x ={*:||x-**!|<e, t }] 

£(.r, (,) )G[£ s = {x: ||g(x) - g(x*)|| <£*}]• 


(e) 


min i (l) = «' ( * ) , 

5 = 


£(x, (,) ) e [ R x = {x : ||x - X*|| < e,} J 


varx ( u) <c, c> 0. 


subject to 
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It is readily seen that the first three problems are associated with finding 
the best algorithm when the number of iteration / is given; the last two 
involve finding the best algorithm that hits, at the minimum number of 
iterations, a given region R x or R q containing the extremum point x*. In 
Section 7.3 we consider some local and integral properties of Algorithm 
RS-4. 

Generally, the problem of comparison of different algorithms according 
to their integral properties is difficult to solve. Some attempts to overcome 
this difficulty have been made by Rastrigin [28]. Another interesting 
problem is how to find the optimal combination of algorithms, each of 
which is capable of finding the extremum of g(x). This problem is solved 
in Rubinstein [33] and uses Bellman's principle of optimality. 

Now we consider some local properties of the random search algorithms, 
assuming that some poini x, has been reached, and that we are allowed to 
make only a single step (iteration). Let s « 1, ...» S, be the point (the 
state of the system) after this single iteration. Let us define the efficiency of 
the random search algorithms as 


£(A X y>) 
W*) ’ 


(7.2.1) 


where 


Ax 


(*) _ 


<*J+i - ~x*> 


that is, where Ax* j) is the projection of the vector xj**, - x, on the direction 
of the vector x, — x*, and <V/ J) is the number of observations (measure- 
ments) of g(x) required for the algorithms in the < th step. For simplicity 
we consider only the case where g(x) is approximately a linear function, 
which is the same as to assume that in Taylor expansion 

g(*, + i) = *(•*, + AjO ~g(x,) + + o{bx t ). (7.2.2) 

Therefore at each iteration made by the random search algorithms, we 
approximate g(x) linearly on the interval Ax,. It is proven in [32] that, for 
a rather wide class of functions optimized by random search algorithms 
under the conditions 

ao <x> 

2 a, =« oo, 2 

1 j- 1 

there exists a number /, sufficiently large and such that for i > I a linear 
approximation of g(x), that is, (7.2.2), is valid. 

Substituting (7.2.2) in any of the four random search Algorithms RS-1, 
RS-2, RS-4, and RS-5 (see, respectively, (7.1.4), (7.1.6), (7.1.11), and 
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(7.1.14)), we readily obtain 


where 


.c'Vi = x , + df'V g(x,)co&'fi' i + o(A.v{ s) ) 


(7.2.3) 


cos (py } = 




(7.2.4) 


and s ~ 1,2,4, 5 corresponds to RS-1, RS-2, RS-4, and RS-5. The distribu- 
tion of <p| J) depends on the specific algorithm and on the distribution of the 
random vector Z\ s) . Let us assume without loss of generality that a (5) =* 1. 
Then taking into account that for a linear function g(x) the direction of 
the vector x* - x i coincides with the direction of the gradient Vg(x,.), we 
can express the efficiency C n (see (7.2.1)) as 


£(cos<p* J) ) 

E{N? S} ) 


(7.2.5) 


We consider here only the efficiencies of the random search Algorithms 
RS-1, and RS-4, assuming that the vector Z is uniformly distributed on the 
surface of the unit n-dimensional sphere. 


(a) The Double Trial Random Search Algorithm RS-1 It follows from 
(7.1.4), (7.2.3), and (7.2.4) that 


cos 


l|Vg(^)|| ’ 



where is a random angle between the vector Z, (l) uniformly distributed 
on the n-dimensional sphere and the vector Vg(x/). We assume here that 
the direction of the gradient corresponds to — 0. Furthermore, it 
follows from (7.2.5) that the distribution of does not depend on /; 
therefore the index / can be omitted. We also omit for convenience index 
(!) in <p* l) . It is shown in the Appendix that <p has a p.d.f.* 


h n {fp) = B„ sin" V (7.2.6) 


where 


r(«/2) 

V^r((n- l)/2) 


(7.2.7) 


We use for convenience - \ £ <p < ] rather than 0 < y> < ir (see Appendix). 
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Since for Algorithm RS-1 we need two observations of g(x) at points 
g(x + fiZ) and g(x — /?Z), respectively, the efficiency C n (see (7,2.5)) is 


r n>_ £(cos<p) 

2 

The expected value and the variance of cos <p are, respectively, 


(7.2.8) 


/ w/2 f v/2 ■ 2 B n 

cos <ph (<p) dtp — 2B n I cos q> sin qprf<p = r 

-v/2 J 0 

(7.2.9) 

var(cos«p) = £(cos z <p) — [ £(cos<p)] 2 

rv/2 7 / 2 B„ \ 2 I / 2 B„ \ 2 

~ 1B -i cos 


Substituting (7.2.9) in (7.2.8), we obtain 

JL 


* n- 1 

and the following relationships can also be easily verified: 

1 


(7.2.10) 

(7.2.1!) 


C„ + ,= 

C* + 2 = 


2 mnC, 

n 


C.. 


ft + I " 

Table 7.2.1 and Fig. 7.2.1 represent the efficiency C„ and var(cos <p) = a 2 as 


Table 7.2.1 

The Efficiency and a 2 

as Functions of n 

? 

£ 

I 

f 

S 

• 

n 

C„ 

a 2 

0, 

a 

2 

0.3184 

0.5995 

0.4112 

3 

0.25 

0.416 

0.3876 

4 

0.2125 

0.314 

0,3792 

5 

0.1875 

0.26 

0.3677 

6 

0.1702 

0.221 

0.3602 

7 

0.1556 

0.1957 

0.3518 

8 

0.1452 

0.166 

0.3564 

9 

0.1367 

0.1401 

0.3652 

10 

0.1294 

0.1344 

0.3529 

11 

0.123 

0.2268 

0.3538 
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Fig. 7.2.1 rhe efficiency and y 2 as functions of n for Algorithm RS-1 


a function of space size n, from which it follows that, as n increases, both 
the efficiency and the variance decrease. When n— too, £(cosqp)-*0 and 
C n -+ 0, that is, the random search Algorithm RS-1 becomes inefficient. 

(b) The Optimum Trials Random Search Algorithm RS-4 It follows from 
(7.1.11), (7,2.3)* and (7.2.4) that 


cos <pj 4) 




0 < sp! 4> <> 2m , 


where cos <pj 4> — max(cos <p n , . . . , cos <p l/V ). Since the distribution of <p* 4) does 
not depend on the step number /, we can again omit the index j. We also 
omit for convenience index (4) in <p< 4) . To find the efficiency of Algorithm 
RS-4 let us find the distribution of F**cosqp, where is distributed 
(compare with (7,2.6) and (7.2.7)) 

B 

M<P)“y |sin"~V|. 0<y<,2 ir 

and 

B n»/2) 

v^r((n-i)/2)' 


By the transformation method (see Section 3.5.2) we obtain 

n — 3 

P„(v) ~B„(l -v 2 )~ 


-1 <c< 1 . 


(7.2.12) 
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The c.d.f, and p.d.f. of *= max( V v . . . , V N ) are, respectively, 



(7.2.13) 

and 



(7.2.14) 

The expected value and the variance of are, respectively. 


E(^)-fv%P n (v%)dv% 

J ~ i 

(7.2.15) 

and 


var(^) = £[(^) 2 ]-[£(^)] 2 . 

(7.2.16) 

For n = 3 we have 


A («> Sr )-^0 

(7.2.17) 


(7.2.18) 


(7.2.19) 

It follows from (7.2.5) that the efficiency of Algorithm RS-4 is 


r E(V°) 
N ' 

(7.2.20) 

For n- 3 we obtain 


„ _ (^-0 
3 “ (N+ I)N ' 

(7.2.21) 


The optimal value of C 3 equals £ and is achieved when N is equal to 2 or 3. 

Generally, it is difficult to find C n and var(K£) for n > 3. Table 7.2.2 and 
Fig. 7.2.2 represent simulation results for C n and var(K^) as a function of n 
for the optimal number of trials N* on the base of 100 runs. It is 
interesting to note that the optimal TV* = 2 and does not depend on n. 

Comparing Algorithms RS-1 and RS-4 for a linear function, we con- 
clude that RS-1 is more efficient than RS-4 for all n> 1. The variance 
associated with Algorithm RS-4 for the optimal N* = 2 is always less than 
that associated with RS-1. The intuitive explanation for it can be given as 
follows. Taking two random trials according to Algorithm RS-1, we always 
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Table 7.2.2 The Efficiency and tbe var( V * ) as Functions of n for Algorithm RS-4 


n 

Q 

var( V°) 

N* 

Cn 

a 

3 

0.198 

0.236 

2 

4075 

4 

0.159 

0.171 

2 

3845 

5 

0.137 

0.134 

2 

3743 

6 

0.121 

0.1 10 

2 

3647 

7 

0.109 

0.053 

2 

3575 

9 

0.092 

0.070 

2 

3478 

11 

0.081 

0.050 

2 

3622 


Note: The sample size is equal to 100. 


find a feasible random direction toward the extremum, which is generally 
not true for Algorithm RS-4. Indeed, the probability of finding such a 
direction (success) in A independent trials is equal P{N) — 1 -(1 —p) N . 
Here p is the probability of success in a single trial. Taking into 
account that for a linear function we obtain, for the optimal 

jV* = 2, P( A* *= 2) = l , that is, the probability of a success in Algorithm 
RS-4 is equal to Defining the efficiency as C^/a {5) , where 
[var (cos<p (,) )] l/2 , we see from Tables 7.2.1 and 7.2.2 that both Algorithms 
RS-I and RS-4 have approximately the same efficiency. 



Fig. 122 The efficiency and the var(E*.) as functions of n for Algorithm RS-4 (ihc sample 
size is equal to 100). 
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73 LOCAL AND INTEGRAL PROPERTIES OF THE OPTIMUM TRIAL 
RANDOM SEARCH ALGORITHM RS-4 

This section is based on Ref. 35, 

73.1 Local Properties of the Algorithm 

The term “local properties” refers here to convergence of the vector 
x i+\ ” X f to the direction of greatest increase of the function g(x), as the 
number of trials m tends to infinity. 

Assume that g(x) is a continuous function and 

</>(*, W)=g{x)+ W, xeDcR", (7,3.1) 

that is, each measurement of the function g(x) is accompanied by additive 
noise W, and assume that the vector Z is continuously distributed on the 
unit sphere with a density /(Z ). Let B be the set on the surface of the unit 
sphere defined by the condition /(E) > 0 and let B be the closure of B . Let 
us also assume that the maximum 

maxg(x + fiZ) *=g(x 4- fiZ 0 ), Z° e B (7.3.2) 

sea 

occurs at the unique point x + /3Z ° . 

We are concerned with the asymptotic behavior of the sequence of 
optimum-trial directions (Z®}®_, defined by 

<*>(* + 0Z°)« max +{x + fiZ t ). (7.3.3) 

1 < A < m 


Theorem 73.1 Vector Z° is almost surely (a.s.) the only limiting vector of 
the sequence {Z®}^^ if and only if the noise W satisfies the following 
property; For a.s. any sequence { W k , of W ' s realizations and for any 
c > 0, there exists a natural number K c (which depends on the sequence) 
such that 

W k < W k + c, K c <k<oo, (7,3,4) 

where 

W k - max W r (7.3.5) 

\ <j <k - \ J 

Proof (1) Suffiency Let us prove that for every 8 > 0, the 6- 
neighborhood S(Z° ? 5) of the point Z° contains almost all optimum-trial 
directions Z°, when m is sufficiently large. The proof is by contradiction. 
Assume that there exists 8 > 0 such that the following holds: There is a 
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positive probability that a realization {£ m }f contains a subsequence 
1 such ‘ hat 

g(x + p- mk ) +w m >g{x + pZj) + W Jt \ <j <m k -\ (7.3.6) 

and at the same time 

Continuity of g(x) implies that we can choose tj > 0 and b { < 8, such that 
inf g(x + flZ)> sup g(* + /?E) + 2 tj. (73.7) 

(a) The case of unbounded noise Assume the sequence {W m i s 
unbounded and satisfies 

(7.3.8) 

Denote by m k the number of the trial in which the maximum W m * is 
achieved, that is, W m = W mk and m k < m k hold. The sequence of indices 
* s a - s » unbounded, because is a.s. unbounded. Therefore 

the event 

will a.s. occur for some m ko > K ,, since at each trial there is a constant 
nonzero probability of its occurrence. Comparing the results obtained in 
trials m k and it follows from (73.7) and (73.8) that 

*i* + + W m t > *(* + P S mJ + V. 

which contradicts (73.6). Q.E.D. 

(b) The case of bounded noise . If sup W = W tm < oo, then the 
sequence {H',}"., a.s. contains an infinite subsequence }£i such that 

2 ,... 

On the other hand, there exists a.s. a particular subscript m r i0 such that 

Thus For any m > mj 0 satisfying Z m g 5(5°, 6), 

g(* + 0£«„) + ^m, 0 > g(* + /JS m ) + + 1? 

>g(* + g2»> + fF m , 

which contradicts (7.3.6). Q.E.D. 

(2) Necessity Assume that the set C of sequences not satisfy- 

ing the theorem’s condition has a probability P(C) > 0 For each sequence 
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from C there exists a number c > 0 and a subsequence i such that 

W k ZW k +c. (7.3.9) 

Our task now is to prove that with probability P(C) the vector Z° is the 
only limiting vector of the sequence What we actually prove is a 

somewhat stronger statement: namely, that the set of limiting vectors 
contains the set 

K = Bn [Z\g(x + ps°) - c<g(x + pZ) < g{x + pZ 0 )} . (7.3.10) 

To prove this statement it suffices to show that for any y £5 and any 
S > 0 the sequence will visit the neighborhood *S(.y,d) infinitely 

often. Indeed, for any trial there exists a constant positive probability of 
entering the set S{y,8)r\ V c - This implies that the subsequence of trials 
{kj} satisfying (7.3,9) a.s. contains a new subsequence {k J( } such that 
Z k ES(y,8)n V c holds. The vectors Z k will be optimum-trial directions, 
since for any /, 1 < / < k Jt — 1 , 

g(x + 02* ) + W k >g(x + + W k 

J l J l *1 

>g(x + pi')+ Wj. 

Q.E.D. 

Remark In the case without noise {W * 0 a.s.) we can explicitly calculate 
the number of trials required to enter a prescribed 5-neighborhood 5(5°, 6) 
of the point Z° with a prescribed probability p . 

Define 

a=l[f(Z)dz) V f{Z)dZ, 


that is a is the probability of visiting 5(2°, 8) at each single trial. The 
probability of visiting S(Z°,8) at least once by making m trials is equal to 


p m = l-(l-«) m . 

(7.3.11) 

Thus if we want p m > p, it suffices to produce 


. ln(l -p) 

m > — 

ln(l - a) 

(7.3.12) 

trials. 


In the case where p — 1 — a, 


In a 

(7.3.13) 

m > — 

ln( 1 - a) 


Table 7.3.1 shows some values of m as a function of a. 
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Table 7 3.1 Dependence ofmooa 


a 0.500 0.200 

0.100 

0.050 

0.020 

0.010 

0.005 

0.002 

0.001 

m 1 8 

22 

58 

194 

458 

1057 

3104 

6903 


73.2 Integral Properties of the Algorithm 

The term “integral properties” refers to convergence of Algorithm RS-4 
to the point of extremum x *. 


Theorem 73.7 Suppose that g(x) has bounded second derivatives. Let 

£(IU/II 2 |*o.*i. •■•»*<) ^ h ) < 00 (7.3.14) 

for 


II It s B< CO J = 0, 1 where = ft 'r^Z, 0 *. 
Let the normalizing factor y, satisfy the condition 
0<Y,(t 1 ||*,|| +*.-) < 00. 


(7.3.15) 


where 

and 


T,= l. if ll^ftll > o 


T, = 0, if 11^/9,11-0 

(V i is defined in (7.3.18)), and let a, and ft be such that 

oo oo 

a,. > 0, ft > 0, 2 a ,A <0 °. 2/^<°o, 

i - I i-l 


then the optimal trial random search algorithm 


2 «/- 00 ; 

/-I 

(7.3.16) 

(7.3.17) 


converges a.s. to x*. Here w(-) denotes the projection operator on D (i.e., 
for every x e R ", w(x) e Z> and ||x — ir(x)|| - min ye0 ||x ~ y ||. 


Proof Since g(x) has bounded second derivatives, it is readily shown that 
E{ iJxJ-CtVgixJ+fiM. (73.18) 

where C, and the vector K have bounded components, that is, < oo, 
UK- 1| < oo. Further, convergence of (73.17) to x* follows from Ref. 10, 
Theorem 1. 
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7.4 MONTE CARLO METHOD FOR GLOBAL OPTIMIZATION 

(a) Deterministic Optimization Problem The problem of finding the 
global extremum of g(x) (see (7. LI)) has been approached in a number of 
different ways. The earliest methods were associated with the grid tech- 
nique and the function was evaluated at equispaced points throughout D . 
We shall consider only Evtushenko's algorithm [1 1) in such a deterministic 
sense. Some other deterministic approaches for global optimization are 
given in Dixon [7], Shubert [37], and Strongin [39]. Evtushenko makes the 
following assumptions about the function and the objective: 

1 The function satisfies the Lipscitz condition, that is, 

|«(*|) - S(-*2» - x 2 II * 

for any x,, x 2 £ D, L > 0. 

2 Each x E D t , where 

D,= {x : |g(x) -g(**)| < e), 

is accepted as an approximation for x*. 

Evtushenko’s algorithm is as follows. 

Algorithm Gl - l 

1 Evaluate the function at N equispaced points x ]f , . . ,x s throughout D 
and define 

,v*=g(**). AT. 

2 Estimate g* by 

A/* = max< y N ). 

The theoretical background to this approach is very simple. Let V i be the 
sphere j| x — xj| < r t where 

r ,= L ’( *(■*,•) - M» +e). 

Then for any x E V } 

g{x) > g{ *,) - = M n — £. 

Hence if the sphere V n i = covers the whole set D, then M N 

cannot differ from g* by more than e, and the problem is solved. 

In the simplest case where D is an interval, a < x < b, Evtushenko 
proposed the following procedure: 

x, = a + X’ 

2e+g(x k )~ M k 

*k+\ = x K+ £ 

M n = max(g(.t A ,), 
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The number of function evaluations required to solve the problem is 
greatest in the case of a monotonically increasing function, namely 


Most algorithms for global optimization contain random elements and are 
related to the Monte Carlo method. We consider some such algorithms. 

Brooks [4J suggested, for solving problem (7.1.1), the following “pure” 
random search algorithm. 

Algorithm Gl-2 

1 Generate X y X N from any p.d.f .f x (x) such that f x (x) > 0, when 

xeo. 

2 Find Y k =*g(X k ), 

3 Estimate g * by 

M n = max(r,,..., Y„). 

This algorithm was also discussed in Ref. 36. Our nomenclature follows 
that reference, and our discussion is based on it. 

Let p be the probability measure defined on B y the Borel a-field of D y so 
that {D, B, P) is a probability space. 

Let g ~ \a,b) = {x e D .a < g(x) < b} y and let F{y) « P{Y i <y); then 

F(y) “ ^ > {«(A',) <y } = 

and Y x ,'., y Y N are independent identically distributed (i.i.d.) random vari- 
ables (r.v.’s) on R l with a cumulative probability distribution function 
(c.d.f.) F 

Proposition 7.4.1 Suppose P assigns a positive probability to every neigh- 
borhood of x *, and suppose g is continuous at x*, then 

lim M„=g* a.s. (7.4.1) 

N — *-oo 

Proof It is clear that F(g*) = 1 and for each 8 > 0 we have 1 - F(g* - 8) 
= P{g* - 8 < g(X;) < g*} > 0 by our assumption. Let ,4^(6) be the event 
{M N <g*-S); then />{/!„(«)} = F*(g* - 8) and - 

F(g* - 5)/(l - F(g* - 8)) < oo. By the Borel-Cantelli lemma P{M N < g * 
- S infinitely often} = 0 for all 8 > 0 and thus (7.4.1) follows. Q.E.D. 


The choice of P, and consequently the resulting F, depends on our prior 
knowledge of x*. If it is known that a certain region is more likely to 
include x* y then it would be more efficient to assign a higher probability to 
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that region. If nothing is known a priori about x* f a uniform distribution 
over D can be assumed. 

In guaranteeing (7.4.1) the exact choice of P is immaterial. However, the 
rate of convergence is determined by the properties of F. For example, by 
a theorem of Gnedenko [14], if there exists a constant a > 0 such that 


lim 
81 0 


\-F(g*-c8) 
1 -F(g*-8) 


Vc> 0 


(7.4.2) 


then 

lim P\ — — — < *) * exp{ - |x|") , x < 0, (7.4.3) 

*T«> { a N J 


with a N determined by F(g* - a N ) » ( N - 1 )/N. Some more properties of 
M n are listed below. 

1 Geometric distribution Let N e be the first N for which M N >g* - 8. 
Then is a geometric r.v., that is, 

/>{* 6 = *} = F*- , (g*-S)[l~F(g‘-fi)], Ac = 1,2 

(7.4.4) 

Consequently, it is well known that 

EN * " 1 -F(**-S) = Ve 


and 

P{NsZ/‘}-i-F k (g*-8)sP SwA . 

It is clear that 7 } s — ► oo as and thus P 6 =* 1 -(1 - 1 1 ~ 

e " 1 = 0.63 (here [tj] is the integer part of tj). Hence t} 6 * EN h is approxi- 
mately a 63% confidence bound for the number of trials necessary to 
make M N > g* - 6 (6 > 0 small). Let a » 1 - F(g* - fi); then P 6k — 1 — 
(1 -a) k . For every given pair (a,0) the smallest k for which P 6tk > /S is 
k( a,/?) *= ln(l - /?)/ ln( 1 — a), and Table 7.3.1 with k( a, I - a) = m can be 
used again. 

2 Lack of memory It is well known that (7.4.4) implies 

P{ N 6 > k + m\N k > m} = P{N S >k}. (7.4.5) 

In terms of N M we thus have 

P{M k+m <g*-6\M m <g*-8} = P[M k < g* — 8), 

because the events [N d > k) and {M k < g* - 6) are identical. It follows 
that, given m successive failures (to enter {y :y > g* - 6}), the conditional 
distribution of the number of trials necessary for the first success equals its 
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unconditional distribution. In particular we have 

E{Ni\M m <g*-8)-m + EN S . (7.4.6) 

3 Poisson approximation If (7,4.2) or (7.4.3) hold, then Z g v , the 
number of Y n i — 1,2, for which Y t > g* - S y is asymptotically 

Poisson distributed. More precisely, for fixed N and 8 > 0, Z 6 N is a 
binomial r,v. with parameters N and p — 1 - F(g* - 8). When (7.4.2) holds 
by substituting 8 — a N in (7.4.2), we obtain /V[l - F(g * - 5a^)]-*c", which 
implies that Z (Qh n converges in distribution to a Poisson r.v. with parame- 
ter c a . 

The problem of finding the global maximum of g(x) can be reduced to 
that of finding the mode for association with g(x) density function. 
Indeed, if g{x) > 0, x e D y then \p(x) = c ~ 'g(x) where c~ 1 * {}g(*)dx)~ 1 
is a density function, and the problems of finding the global maximum of 
g(x) and finding the mode of \p(x) are equivalent. This can be solved by 
one of the methods mentioned in Refs. 41, 42, and 46. 

If g(x) is unrestricted in sign but bounded, that is, if |g(x)j < /c, then 
fx(x) — c~\g(x) + k) y where c 1 « [/(g(x) + k)dx) { is again a density 
function. 

A natural extension of the “pure” random search algorithm Gl-2 is the 
so-called multistart algorithm [7], which is probably the one most fre- 
quently used in practice for global optimization. In this approach we use 
any iterative procedure (gradient, random search, etc.) for local optimi- 
zation and run it from a number of different starting points x 0ji j** 

1 , N. The set of all terminating points hopefully includes the global 

maximum x *. 

The multistart algorithm is as follows. 

Algorithm Gt-3 

1 Generate X 0[r . . . , X 0N from any p.d.f./^x) >0,.x£D (usually X 0 is 
chosen to be uniformly distributed over D), 

2 Consider X ol> . . . , X ow as the starting points, then apply N times a 
local optimization algorithm (gradient, random search, etc.) and find the 
local extrema x*, — x* N of g(x) associated with X ov . . . , X 0N . 

3, Estimate x* by 

max (x *, . .. ,x%). 

Let us define D ; as the set of starting points X 0j from which the algorithm 
will converge to j - th local maximum. We call the region of attraction of 
the 7 th local maximum. Let us assume that the number of local maxima 
is finite, and let X 0 be uniformly distributed over Z>; then the probability 
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of at least one X 0 j, from a sequence of N points drawn at random over 
D , falling in the region of attraction of the global maximum D* , equals 


/>= 1 - 


m{o;) 

m(D) 


(7.4.7) 


where m(D) is the measure of D. 


A more sophisticated approach to the global optimization problem was 
suggested by Chichinadze [5], who introduced a probability function P(v) 
as the probability of g(x) < t\ that is, if m(V) is the measure of the level 
set 

V= {x:g( x) < v), 

then 


P(v) 


m{V) 

m(D) 


(7.4.8) 


The function P(v) is, of course, not available, but if we calculate g(x) at N 
points distributed at random over D y and count the number M of these 
points for which g(x)<v> then M/N approximates P(v), It is not difficult 
to see that the global maximum corresponds to P(r) = 1 and the global 
minimum to P(t;) = 0. To find the solution P(v)= 1, Chichinadze sug- 
gested approximating P(v) by a linear combination of a set of given 
polynomial functions P,(v) y / = l, . . . , A, 

k 

W=2U(r). (7.4.9) 

1=1 


The range of v was divided at the points j ** and the optimal 

values of A, were determined by minimizing 

( M k \ 2 

2 ’ (74 l0) 

where is the number of points for which g(x)<v Jf and Wj> 0, 
j = 1 The root v* of P(v) = 1 was then determined to obtain an 

estimate of the global maximum of g(x). 

Considerable attention has been paid in the multiextremal optimization 
to the random search algorithms, Gaviano [12] showed that if 


*,+ i + (7.4.11) 

and 

a, « arg( global max g(x, + (7.4.12) 
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then 

lim P{g(x,) ~g(x*) < e} = 1 (7.4.13) 

r-*oo 

for every e > 0. Here Z is a vector uniformly distributed on the surface of a 
unit /i-dimensional sphere. 

If D is a finite space and if a bound on the first derivative of g{x) is 
known, then Evtushenko’s [11] or Shubert’s [37] one-dimensional global 
optimization techniques could be used to find the optimal a r However, for 
a general function, a global optimization along the lines of (7.4.12) is 
difficult to perform. 

Matyas [22] proved the convergence to x* of the following random 
search algorithm. 


Algorithm Gt-4 

1 Generate from an n-dimensional normal distribution 

with zero mean and covariance matrix 2, that is Y~~N( 0,2). 

2 Select an initial point x , E D. 

3 Compute g(x,). 

4 / 1 . 


5 

6 

7 

8 
9 


If x x + Y' E D, go to step 8. 

x i*~ x *+v 

Go to step 10. 

Compute + Y t ), 


X 


< + 1 


JC- + Y t , ifg(x, + Y t ) > g(x,) - £, where e >0 
x i9 otherwise. 


10 /*-/ + 1 . 

11 Go to step 5. 


According to this algorithm, a step is made from the point x i in the 
direction Y k only if x, + Y i E D and g(x i + T, ) > g(x,) - c. 


The following procedure, based on cluster analysis, was introduced into 
global optimization by Becker and Lago [3]. 

Algorithm Gt-5 

1 Select N points uniformly distributed in D . 

2 Take /V, < /V of these points with the greatest function values. 

3 Apply a cluster analysis to these points, grouping them into 
discrete clusters; then find the boundaries of each cluster and define a new 
domain Z>, c Z>, which hopefully contains the global maximum. 

4 Replace D by Z>, and perform steps 1 through 3 several times. 
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This is a heuristic algorithm and its ability to find the global maximum 
depends on the cluster analysis technique used in step 3 and on the 
parameters N and N v There exists a positive probability of missing the 
global maximum. However, in practice this technique is widely used for 
global optimization. More on cluster analysis for global optimization can 
be found in Gomulka [15], Price [27], and Tom [40]. 

(b) Stochastic Optimization Problem Consider the stochastic optimiza- 
tion problem (7.1.16), assuming that 

g{x, W) = g(x) + W y (7.4.14) 

which means that g(x ) is measured with some error W. The following 
Monte Carlo algorithm, which is similar to Algorithm Gl-2, can be used 
for estimating g* in (7.1.16). 

Algorithm Gl-2 ' 

1 Generate X^...,X N from any probability distribution function 
(p.d.f.)/,(xM/,(x)>0,xG/)). 

2 Find Y k =g(X k> W k ) = g{X k ) + W k> ft- 1 N. 

3 Estimate g* by 

M# “ max ( T,, . . . , Y N ). 

Let W k be i.i.d. r.v.’s with a given c.d.f. H. We also assume that the W k 
and the X k are independent and that W+ =* inf {w: 7/(w)“ 1) < oo. The 
following proposition is proven in Ref. 36. 

Proposition 7.4.2. Under the conditions of Proposition 7.4.1 

lim M„ = g* + W+ a.s, (7.4.15) 

Proof : Let 

E n — max W i , 

i <i$/v 

We say that {E s } is stable if there exists a sequence of constants {n*} 
such that for all d > 0 

lim P{|£*-i|*| >«}-<). (7.4.16) 

— ► 30 

We consider three cases. 

1 W+ < oo, in which case our estimate for g* is M N - W 0J and we 
certainly have 

lim ( M s - JT*) — %* a.s. 

jV-*oo 


(7.4.17) 
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2 W, — oo, but {E y } is stable, in which case (7.4,16) implies 

lim (My : — r}y) =~g* in probability, (7,4.18) 

N—*ao 


and i) N is determined by H(i) N ) *= ( N - \)/N, A necessary and sufficient 
condition for case 2 is [14] 

1 - //(w + 6) 


lim 

u— *oo 




= 0, V<S > 0. 


(7.4.19) 


We thus see that, if W+ and tj*. are known, we still have convergent 
algorithms in (7.4.17) and (7.4.18). 

3 W , « oo, but {£*} is not stable. Here we have by (7.4,15) M s — >co 
a.s. Q.E.D. 


The following examples will demonstrate these ideas, 

1 If the W t are normally distributed with mean 0 and variance a 2 , then 
(7.4.19) holds and {£„} is stable with t) n = a{ 2 log 

2 Suppose that the W/ s have the generalized double exponential distri- 
bution, that is, 

* < 0 

*\ x>0. 

Then by (7.4,19) {£„} is not stable for a < 1, but is stable for a > 1 with 

n * = (log(/V/2)) ,/o . 

Algorithm Gl-3 can be also adapted for the stochastic optimization 
problem (7.1,16), rewriting step 2 as follows: 

2 Consider ,Y 0I , . . . , X os as the starting points; then apply /V times a 
local iterative procedure (stochastic approximation, random search, etc.) 

that is able to find the association local extrema xf x% of E[g(x, W)} 

= g(*\ 

(c) Constrained Optimization Consider the following constrained opti- 
mization problem: 

max g 0 (x), (7.4.20) 

jtcoer 

subject to 

g*(x)<0, (7.4.21) 

We assume that the convex programming methods (see Avriel [2]) 
cannot be applied because the convexity assumptions do not hold either 
for the region D = [x : g*(x) < 0, k = 1, . . . , m) or for the function g 0 (x). 
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Let us consider two cases. 

1 If the region D =~ {x : g*(x) <; 0, k = 1, . . . , m) is known, and we can 
readily generate r.v.’s at Z>, then Algorithms Gl-2 through Gl-5 can be 
directly applied for finding the global extremum of (7.4.20) and (7.4.21). 

2 If the region D * {x : g k (x) < 0, k=l,...,/n} is either unknown 
explicitly or is complex, but another region D x that contains D and has a 
simple shape is known, then we generate r.v.’s at D x and accept or reject 
them according to whether X e D or X e (£>, — D\ Next we can apply 
again Algorithms Gl-2 through Gl-5. 


7.5 A CLOSED FORM SOLUTION FOR GLOBAL OPTIMIZATION 

This section is based on the results of Meerkov [23] and Pincus [25]. Both 
papers deal with the multiextremal optimization and use the classical 
Laplace formula for certain integrals. We follow Pincus [25]. 

Consider the optimization problem 

min g(x) = g(x*) = g *, 
xe/> c R n 

where g(x) is a continuous function, D is a closed bounded domain, and 
x* is the unique optimum point. Pincus [25] proved the following theorem. 

Theorem 7.5.1. Let g(x) = g(x,, . . . , *„) be a real-valued continuous 
function over a closed bounded domain D e R n . Further, assume there is a 
unique point r*£/) at which min cei? g(x) is attained (there are no 
restrictions on relative minima). Then the coordinates x* of the minimiza- 
tion point are given by 

j x,exp{-Xg(x))dx 

x* = lim — , i* (7.5.1) 

f exp(-\g(x))dx 

In particular the theorem is valid when D is convex and the objective 
function g is strictly convex. The proof of the theorem is based on the 
Laplace formula, which for sufficiently large X can be written as 

f x i exp ( - Xg( X )) dx « X* exp(-\g(x*)) (7 .5 .2) 

J D 

f exp(-Ag(x))<i)ca s exp(-Ag(x*)). (7.5.3) 

J D 

We now outline a Monte Carlo method based on Metropolis et al. work 
[24] (see also [26]) for evaluating the coordinates of the minimization point 
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■****(*?»••-**£)* that is, for approximating the ratio appearing on the 
right-hand side of (7,5.1). For fixed A (7.5.1) can be written as 

f x t c\p(-\g(x))dx 
J n 


J exp(-\g(x))dx 


(7.5.4) 


For large A the major contribution to the integrals appearing in (7.5.1) 
comes from a small neighborhood of the minimizing point x*. Metropolis’ 
sampling procedure [24J, described below, is based on simulating a Markov 
chain that spends, in the long run, most of the time visiting stales near the 
minimizing point and is more efficient than a direct Monte Carlo, which 
estimates both the numerator and the denominator separately. 

The idea of the method is to generate samples with density 


/*(*)■ 


exp(-Ag(x)) 

J exp(-\g{x))dx 


x £ D, 


(7.5.5) 


where the denominator of (7.5.5) is not known. This is done as follows. 

Partition the region D into a finite number N of mutually disjoint 
subregions and replace integrals over D by corresponding Riemann 
sums using the partition {Dj}. Fix a point y J = ... y y') e D y Then 

construct an irreducible ergodic Markov chain ( X k } with state space 
and with transition probabilities p.j, 1 < /, j < TV, satisfying 

2 ,• 7 * i P in J ~ I where tt y = 

exp 1( -Ag(>^)]/2j/L j exp 1[ - Ag(.y*)}; that is, (? r y ) is the invariant distri- 
bution for the Markov chain. It should be noted that, in the last expression 
for 77, we have assumed for simplicity that all subregions Dj have equal 
volumes. Then using the strong law of large numbers for Markov chains, 
we have with probability I 


1 m 

1 2 ** - 
rn * m— .oo 


V ,v 

2 y* c *p[ -Ag(^ y )] 

>-i j - i 


2 exp[ -Xg(y J )} 

j - 1 

f x,exp(-\ g(x))dx,..., f x n exp(-\g(x))dx 

\ J D J D * 

f exp(-A g(x))dx 

J n 




(7.5.6) 



262 


MONTE CARLO OPTIMIZATION 


The sampling error for each component X' k of the vector X k is (see [26]) 
— ji,) 2 ] < c/m, where c is a positive number, 

2 y/exp[ -AF(y)] 
j= i 

= —z • 

2 exp[-A/ty)] 

J- 1 

From Chebyshev’s inequality we have 


Pi 


k — m 

' 

k — 1 

> e 


e 2 m 


We now turn to the question of how Metropolis constructs a Markov 
chain with the required invariant distribution. He starts with a symmetric 
transition probability matrix P* -(/** ), 1 < i, j < N, that is , p?j—p*, 
pfj> 0, 'Z'jZ’IpTj— 1, the known ratios 7 ^/ 7 r jt and defines the transition 
matrix of the Markov chain {X k ) as follows; 


p>j 

7T. 

if — < 1 , 1 

”, 

w i 


7 r. 

p*> 

if > 1*1 ¥>j\ 


/ 7 T { \ 

p* + £/.,<«, 



(7.5.7) 


It is shown in Ref. 16 that a Markov chain with the above transition 
matrix has the invariant distribution { 7 r f }, that is, 7r, » A chain with 

such a transition matrix can be realized as follows. Given that the chain is 
in stated at time k , that is, {X k = y the state at time k + 1 is determined 
by choosing a new state according to the distribution {p* Jt j — I,..., Af). If 
the state chosen isy\ we calculate the ratio 7 ^/ 77 ,. If 7 ^/ 77 - > 1, we accept 
y J as the new state at time k + 1 ; if ttj / n t < 1, we tak ey J as the state of the 
Markov chain at time k + 1 with probability 7 ^/^- and y* as the new state 
at time k + 1 with probability 1 — 7 ^/ 77 ,. It is also shown in Ref. 16 that 
this procedure leads to a Markov chain with transition matrix P = (p^). 

It should be noted that (7.5.1) can be useful not only for finding the 
global optimum in a multiextremal problem, but also for solving nonlinear 
equations (see [20]) and some kinds of problems in statistical mechanics as 
well (see [16]). 
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7.6 OPTIMIZATION BY SMOOTHED FUNCTIONALS 

Consider the following stochastic optimization problem (see (7.1,1 6)) 

min E„[<t>(x, W)] = min g(*) =g(x*) (7.1.16) 

xBD cK n .re/) C*" 

where <£(*, W) is a stochastic function with unknown p.d.f. p(x) y D is a 
convex bounded domain, and x* is the unique optimal point. We also 
assume that g(x) is bounded for each reD and var^l^jc, W)} < oo. For 
solving this problem let us introduce the following convolution function: 

g( x >P)=f h(v,/3)g{x - v)dv = f h((x-v),/3)g(v)dv, 

(7.6.1) 

which is called a smoothed functional [ 1 8}, 

In order for g(x r fi) to have nice smoothed properties, let us make some 
assumptions about the kernel h(v,{3). 

1 h(v,P) = (l//3 n )h(v/(}) = {\/(i n )h(v l /p v„/P) is a piece-wise 

differentiable function with respect to c. 

2 Iim^o h(Vyfl) = 5(c), where 6(c) is Dirac’s delta function. 

3 lim^ 0 g(x, fS ) = g(x), if jc is a point of continuity ofg(jr) 

4 h(v 7 p) is a p.d.f., that is, g(x, ft) = E y [g(x ~ V)\. 

We assume that the original function g(x) is not “well behaved.” For 
instance, it can be a multiextremal function or have a fluctuating character 
(see Fig. 7.6.1). 

We expect “better behavior” from the smoothed function g(x,fi) than 
from the original one. 

The idea of smoothed functionals is as follows: for a given function g(x) 
construct a smoothed function g(x>fi) and, operating only with g(x,fi), 
find the extremum for g(.\). In other words, while operating only with 



Fig. 7.6.1 A bed “behaved” function. 
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£(*,/?), we want to avoid all fluctuation and local extrema of g(x) and 
find x *. 

It is obvious that the effect of smoothing depends on the parameter /?: 
for large /? the effect of smoothing is large, and vice versa. When /?— >0 it 
follows from condition 2 that g(,v,/?)-+g(x) and that there is no smooth- 
ing. 

It is intuitively clear that, to avoid fluctuations and local extrema, ft has 
to be sufficiently large at the start of the optimization. However, on 
approaching the optimum we can reduce the effect of smoothing by letting 
P vanish, since at the extremum point x* we want coincidence of both 
extrema, g(x) and g(x,/3). Accordingly, we speak of a set of smoothed 
functions g(x,ft), s = 1 , 2 ,..., while constructing an iterative procedure 
for finding x*. 

Before describing the iterative procedure for solving the problem (7.1.16'), 
we derive some attractive properties of £(*,/?). 

property I If g(x) is convex, then g(x y fi) is also convex. 

The proof of this properly is straightforward. For 0 < A < 1 
\g(x,fi) + (l -\)§(y,p)-g(\x + (l - A)/,/?) 

= fh(v,p)[ \g(x - v) + (I - A)g(> - v) -g(\x + (I - X)y - t>)] dv. 

(7.6.2) 

The convexity of g(x) implies 

g(Ax + (1 ~ A).v - v) = g(A(x - v) + (1 - \)(y - t>)) 

< \g(x - v) + (I - A)g(/ - i>). (7.6.3) 

Substituting (7.6.3) in (7.6.2) and taking into account that h( v,fi) > 0, 
we obtain the proof immediately. 

property 2 It is readily seen that the gradient of the smoothed function 
g(x,/J) may be expressed as 

g,(x,P)=f h x ((x - v),p)g(v)dv™ f h v (v,p)g(x - v)dv 

J - 00 J oo 

(7.6.4) 

and is called a smoothed gradient . Using the right-hand side of (7.6.4), together 
with condition ( 1 ), we obtain 

= h v (v)g(x - fiv)dv, 


(7.6.5) 
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where 


8*(t?) dhjv) | 

at;, dv n } 


(7.6.6) 


is the gradient of /i(u) and dh(v)/3v k , k = l,...,/i, are the partial deriva- 
tives. 


It is important to note that, to find a gradient of the smoothed function 
g(x,{3), we do not need to know the gradient of g(x)> which sometimes 
does not exist at all. 

We consider also the following smoothed function: 

g(x>fi)=[ h(v,P)[g(x + v)+g(x-v)]dv. (7.6.7) 

oo 

By analogy with (7.6.4) and (7.6.5) we can obtain the smoothed gradient 
for g(x,P): 


g x (x,P)= f h v (v,P)[g(x-v)-g(x + v)]dv 

J “ QO 

= 4f *„(»)[ - Pv)- g(x + Pv)]dv (7.6.8) 

P J — QO 


Now we give two examples of kernels which satisfy conditions 1 

through 4, and find their smoothed gradients according to (7.6,8). 

Example 1 Let /f < t? ) be an ^-dimensional standard multinormal distribu- 
tion 


h(v) 


(2 v) 


n/2 


exp 


i- 1 / 


— 00 < v s < 00 , 


Then the smoothed gradient of g(x) is 

£,(•*, P) = vh (v)[g(x + pv)-g(x - Pv)} dv. 


(7.6.9) 

(7.6.10) 


Example 2 Let 


T(n/2) 

2-n n/1 

o, 


11011 = 1 


fi(v) = ( 




(7.6.11) 
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that is, let the random vector v be uniformly distributed over the surface of 
the unit sphere. The smoothed gradient equals 

8x(*’P) = ^f vh(v)[g(x + Pv)-g(x-/iv)]dv. (7.6.12) 
P ■'»•>»-» 

Having g x (x,P) at our disposal, we can construct, for instance, an 
iterative gradient algorithm 

*,+ 1 = *{Xj - agJXi.fi,)}, a >0 (7.6.13) 

and find the conditions under which x, converges to x * in the deterministic 
optimization problem min xe/ , c/r ,g(x) g(x*), which is a particular case 
of (7.1.16'), with p(w ) being a Dirac 6 function. 

Here tt( • ) denotes the projection operation on D (i.e., for every x e R n , 
7 t(x)E D and |(jc — ir(x)|| = min > , e/J || jc ->'||), and a is a step parameter. 

Since g(x) is not a “well behaved” function, calculation of the multiple 
integrals g x (x,P) and g(x,/3) are usually not available in explicit form and 
numerical methods have to be used. One of them is, as we know, the 
Monte Carlo method. For instance, an estimator of g x (x,ft) can be found 
by the sample-mean Monte Carlo method (see Section 4.2.2) 

^j^[g(x-0Vj)-g(x+fJyj ) ] (7.6.14) 

and is called parametrical statistical gradient (PSG) [18], 

Here f(v) is a p.d.f. from which a sample of length N is taken. Assuming 
that f(v) = h(v), we obtain, respectively, the PSG in examples 1 and 2, as 

1 * 

i(x 'P)~~Np 2 vUix+tiV^-Kix-PVj)] (7.6.15) 

and 

N 

Si(x ' (i) = lk 2 V j [z{x + ^v j )-g(x-pv j )]. (7.6.16) 

The r.v.’s in (7.6.15) and (7.6.16) are generated from (7.6.9) and (7,6.11), 
respectively. 

By analogy with (7.6.7) the smoothed gradient of £(x, W) is 

r h v (v,P)[*(x-v, W x )-+{x+v,W 2 )]do 

J — co 

= ^ f h v (v)<t>[(x- /3v, W } )-<i>(x + fiv, W 2 )]dv , 

P J — oo 


(7.6.17) 
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and by analogy with (7.6.14) the sample-mean Monte Carlo estimator for 
the smoothed gradient of <£(*, W) is 

“ 2 + 

(7.6.18) 

Assuming/) t>) = h(v) by analogy with (7.6.15) and (7.6.16), we have the 
PSG for Examples 1 and 2, respectively: 

*(*,0) = — 2 vX+ix + pV^-oix-liV^)] (7.6.19) 

S 

2 V / [+(x + pV r W lJ )-+{x-fiV J ,W v )]. (7.6.20) 

From (7.6,18) through (7.6.20) it follows that the estimator £(*,/?) of the 
smoothed gradient W) is constructed on the basis of observations 

of <f>(x, W) alone. Both the “artificial” random variable V and “natural” 
random variable W are averaged in these equations. Table 7.6.1 presents 
some smoothed gradients and their estimators. 

Assuming that the r.v.'s V and W are mutually independent and taking 
the expectation of £(*,/? ) with respect to W and V % we obtain 

£«(*.£)) 2 €,)-*,(*.£). (7.6.21) 

where 

ij = £ ~y [ +(* + 0K' “ +(* - f* V r W 2j)]- (7 .6.22) 

That is, the PSG £(/c,0) is an unbiased estimator for the smoothed 
gradient <k(x, W). Assuming also the independence of M^.’s J « 1, . . . , N, 

we obtain the variance of the .rth component of £(*,/?): 

1 N 1 

var£,(x\/3) = — 2 var ^^ = 77 var£,(x,0) * N ~'fi ^{x), 

N y-i * 

s- (7.6.23) 

where 

°?(x) = var [ <p(x - 0V, W x )~ *(x + fiV, W 2 ) 


(7.6.24) 



TABLE 7.6.1. Smoothed Gradients and their Estimators 



if x < 0. 
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and 

n 

£{<«(*.(8),«*.0)>}-var 2L(*,y8) + <£[«(*,£)], £[{(*,£)]> 

. J - 1 

< o 2 N -'0~ 2 + (g x (x, P),g x (x, P)} . (7.6.25) 

Here h v5 (V) is the slh coordinate of the vector h v {V), ( » ) denotes the 
scalar product, and 

a 2 = n 2 max sup<r/(x). (7.6.26) 

J xGD 

Note that n 2 appears in (7.6.26) rather than n because of the covariance 
terms. 

Taking into account that g(x) is bounded for all x£i) and 
var^[^(x, W)) < oo, we can readily conclude that o 2 (x) < oo for all x G D 
and therefore a 2 < oo. 

Now problem (7.1.16') can be solved by the following algorithm: 

*,+ > = w (*, - «$(*/'&))• (7.6.27) 

Theorem 7.6.1 Assume that the iterative process is constructed in accor- 
dance with (7.6.27) and that for every x E D and for every / the following 
conditions are satisfied: 


«*- 

• **). «,(*>&» 2: K| II-* ~ -**l| 2 ~ Y; 

(7.6.28) 

<«*( 

X'P)-L( X ’P)> ^ ^2 li — -^*1! 2 

(7.6.29) 


0<a<2K l K 2 

(7.6.30) 


lim 1 = 0 

i—* o© 

(7.6.31) 


lim y, =0, Y; > 0 

(7.6.32) 


£ ll*i II 2 < °°» 

(7.6.33) 


where \\x || = (2£. ,x*) ,/2 is the norm of x, and K ] and K 2 are positive 
constants. Then process (7.6.27) converges in the mean square to the point 
x*, that is, lim^^ZsHx. - x*\\ 2 - 0. If we replace condition (7.6.31) by 

oo 

2 Pr 2 N~' < oo (7.6.34) 

/-I 

and conditions (7,6.32) by 

oo 
i-l 


(7.6.35) 
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then process (7.6 27) converges, with probability J, to a*, that is, 
P{ lim ][*,.- x*|| = 0) = I. 

1 I — * oo ; 


Proof Without loss of generality we can set x* == 0. Taking the condi- 
tional expectation of ||x # + , ||, given x } , . . . >x if we obtain from (7.6.27) 

^ II-*,- II 2 - 2 a<x„£[ £(x 4 , ft)] > 

+ a 2 £[<£(*i>A)>£(-*<»A)>]- (7-6.36) 
Substituting £[£(x, /?)] = g x (x, /?) in (7.6.36), we obtain 

E { || x,. + , || 2 1 a x„ ) < || x, || 2 - 2a<x,., g x { x ' , ft )> 

+ a 2 £[<£(x j ,ft),£(x l ,ft)>]. (7.6.37) 
Now taking (7.6.26) through (7.6.29) into account, we obtain 

£( !!•*<+ 1 II 2 |*i x i) 5 ll*,ll 2 - 2 «^ill*<ll 2 + 2 «7 1 

+ a 2 J\r'ft'V + « 2 tf 2 ||x,i | 2 

= (» - 2aK , + a 2 Ar 2 )||x,[| 2 + « 2 ^-'ft 2 o 2 + 2a Yj . 

(7.6.38) 

Taking the expectation of both sides of the last inequality, we obtain 

£||*,+ ill 2 ^0 “ 2aK \ + « 2 ^2) £ 'll*,l| 2 + a 2 ft' 2 A',' l o 2 + 2a Y , 

+ (1 - 2a£, + a 2 /t 2 ) , £||X||| 2 

I 

+ 2 { a2 Ps l <* 2 + 2ay J )(l — 2aX', + a 2 /k 2 ) # J . 

j ■ ! 

(7.6.39) 

It follows from (7,6.30) that 1 — 2aK ] + a 2 K 2 < 1; therefore (7,6.39) can be 
rewritten as 

i 

£||* i + iII 2 ^*3£||*,I| 2 + 2 WPr'N'-'o' + lay,)^-’, (7.6.40) 

J- I 

where 

A:3= 1 -2aK ] + a 2 K z . (7.6.41) 

The first term in (7.6.38) converges to 0 as /—►oo, since K y < I and 
iTII*,!! 2 < oo (see (7.6.33)). Thus the theorem will be proven if we prove 
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that 


lim 2 (a 2 A“ 2 /V" 1 -2a Tj )/Cr J = 0. 

1-00 swml 

To prove this we assume that for any number e we have chosen a 
number T such that, for all s > T, a 2 fi s ~ 2 N s ~ V 2 — 20 ^ is less than e. Then 

i T 

2 ( a %- 2 N-W + 2 a y,)K<- < K‘ 


5-1 


2 {a 2 o 2 p- 2 N,-' + 2ay s )Kr 

S “ 1 


+ 2 

1 


(7.6.42) 


In view of the fact that T is finite, the first term in (7.6.42) tends to zero as 
/ —+ oo, since K 3 < 1. Using the formula for a geometrical progression, we 
obtain: 

, limetf 3 (l -Kr T ) 

lim £||x,|| 2 < lim eKj 2 Kf’ < ~ r — p • 

r — 00 t-*-oo jo* 7 *+ | t A 3 

Since e may be any positive number, we have lim / _ roo £’||jr # - || 2 = 0. 

This completes the proof of the first part of the theorem. 

To prove the convergence of (7.6.27) with probability 1, it is sufficient to 
show that 2® ,E(||*f|| 2 ) < oo. Summing both sides of (7.6.40), we have by 
(7.6.34) and (7.6.35) 

2 £(lK+ill 2 ) < r~- £(||-x,|| 2 )-t- 2 2 + 2ay } )K' z -’ 

I- I 1 A 3 1 - 1 I 


1 -K 


*3 


l-K, 

from which the result follows. 


£(II* i II 2 )+ 2 2(« 2 ft- 2 A , r 2 a 2 + 2«r,)Ar' 

J — I I — J 
oo 

£(l|3r,|| 2 )+ 2 {a%- 2 Nr l o 2 + 2ay t ) 


< CO, 

Q.E.D. 


Remark I The theorem remains valid for the deterministic optimization 
problem 

min g{x) = g(x*), 

which is a particular case of problem (7.1.16 ), when W ** 0. 


Remark 2 Condition (7.6.28), together with (7.6.32), allowed g(x) to be 


nonconvex. 
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APPENDIX 

Let Z be a random vector uniformly distributed over the surface of a unit 
n-dimensional sphere with its center at origin, and let R be any given unit 
vector issuing from the origin (see Fig. 7. A.1). 



Fig, 7. A.! Graphical representation of the random vector 


The p.d.f, of the random angle between Z and R is sought. For reasons 
of symmetry we confine ourselves to the semisphere 0 < <p < tt. The p.d.f. 
is then [28]: 


sin" 2 (p _ 


h„(<p) = -p r ^ — ~ — ” B m sin" 2 <p, 0 < <p < w, 

/ sin" 2 <p dip 


where 




T(n/2) 


The expected value of the r.v, 9 is 


V^r((/i- i)/2) 




from which it follows that on the average, R and Z are orthogonal. 

It is readily verified that, as n increases, /»„(<?) approaches to Dirac’s 5 
function, that is, 

n — * 00 ** 

Fig. 7.A.2 represents h n (<p) for different n. 



? Fig. 7.AJ The density function of 9? for differ- 
ent n . 



REFERENCES 


273 


EXERCISES 

1 Find the efficiency C„ (7.2.8) and var(co$<p) of Algorithm RS-2 analytically and 
of Algorithm RS-5 by simulation. For Algorithm RS-5 describe the random 
number generator and the flow diagram of your program. 

2 Prove that for a linear function g(x) the direction of V iS in Algorithm RS-5 (see 
(7.1.13)) coincides, on the average, with that of the gradient of g(x). 

3 By analogy with algorithm RS-1 (see (7.1.18)) describe the nonlinear tactic 
Algorithm RS-2, the linear tactic Algorithm RS-3, the statistical gradient Algorithm 
RS-5 for solving problem (7.1.16). 

4 Prove that, if g(x) is convex in R n and if the point x* in which g(x) attains its 
minimum value is unique, then g(x,/?) (see (7.6.1)) is strictly convex. 

5 Given a linear function <c,x> invariant for the convolution (7.6.1), that is, 
/ h(p,x- «)<c,«> du = <c,x>, prove that g(x,0) > g(x). 

6 Prove (7.6.4) and (7.6.5). 

7 Prove that, if /i„(<p) £„|sin fl-2 <p|, 0 < <p < 2ir> then P„(i>), where v ** cos<p is 

distributed according to (7.2.12). 

8 Consider the following modification of Algorithm RS-1 (see (7.1.4)): 

*<+ i - + *^[g( x k + - g(x i ))‘E il a, > 0, p t > 0. 

Find the efficiency C„ (7.2.8) and var(cosqp), assuming that g(x) is a linear 
function. 
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